Automatic Schema Matching Nicole Oldham CSCI 8350 (Semantic Web Course @ Univ of Georgia) Topic Presentation Outline • • • • • • • Introduction Application Domains Classification of Schema Matching Approaches Current Work MWSAF Matching Open Research Directories Conclusion Schema Matching • Match: Takes two schemas as input and produces a mapping between the elements that correspond to each other semantically. • It is usually performed manually. - Tedious Time Consuming Error Prone Expensive We must automate this process! Example • GTE telecommunications needed to integrate 40 databases with a total of 27,000 elements. • Project planners estimated that manual matching would take 12 person years to integrate. Doan A, Halevy A. Semantic Integration Research in the Database Community: A Brief Survey. Various Levels of Heterogenity ftp://ftp.dagstuhl.de/pub/Proceedings/04/04391/04391.ChristophidesVassilis.Slides.pdf How to deal with Semantic Heterogenity 1. Standardize: agree on a common representation 2. Translate: create mappings between different schemas -requires human input and machine reasoning -mappings can be difficult and expensive 3. Annotate: create relationships between agreed upon conceptualizations -requires human input and machine reasoning -annotation can be difficult and expensive ftp://ftp.dagstuhl.de/pub/Proceedings/04/04391/04391.ChristophidesVassilis.Slides.pdf Challenges • Actual semantics of the involved elements are typically only from the creators or documentation – so we must use clues in the schema and data instead. • These clues are often misleading. • Ie. ‘Area’ can refer to different entities • Ie. The same entities can have very different names. • Clues are often ambiguous. • Ie. ‘Contact-agent’ Agent name or phone number? • Matching process can be very costly • Each element of the schema must be examined to ensure discovery of the best match. • Matching is often subjective depending on the application. Doan A, Halevy A. Semantic Integration Research in the Database Community: A Brief Survey. Outline • • • • • • • Introduction Application Domains Classification of Schema Matching Approaches Current Work MWSAF Matching Open Research Directories Conclusion Where is Schema Matching used? • Database Application Domains - Data Integration Data Warehousing E-Business Query Processing • Semantic Web - XML/HTML to an Ontology - Semantic Web Services Bernstein P, Rahm E. A survey of approaches to automatic schema matching Schema Integration Problem: Construct a global view from a set of independently constructed schemas. (ie: ontologies) - Different structure and terminologies Solution: Schema Matching is performed to find relationships between concepts in each schema. Then the matching elements can be unified. Bernstein P, Rahm E. A survey of approaches to automatic schema matching Data Warehouses Problem: Integrating data sources into a data warehouse. - Different formats between the source and warehouse. Solution: Use matching to find the elements of the source that are also present in the warehouse. Then the details of the semantics can be examined to integrate the two. Bernstein P, Rahm E. A survey of approaches to automatic schema matching E-Commerce Problem: Message translation. -Each trading partner uses its own message format. Solution: A match operation would reduce the amount of manual work to specify how the formats are related. Bernstein P, Rahm E. A survey of approaches to automatic schema matching Query Processing Problem: The terms used in the user’s query may be different from those in the database. Solution: Matching is used to map the user-specified concepts in the query to schema elements. Bernstein P, Rahm E. A survey of approaches to automatic schema matching Need for Data Integration on the Semantic Web • Problem: Web documents are not in RDF or any form suitable for the SW. • We must annotate them with concepts from ontologies. • Solution: Use schema matching to map between elements represented in OWL and the different schemas of web documents. Semantic Web Services • Problem: Web Services are currently searched for using keywords. • We need to annotate the WSDLs with semantic metadata so that they can be discovered efficiently. • WSDLs are in XML, Ontologies in OWL! • Solution: Use schema matching approaches to map between the two different schemas. Outline • • • • • • • Introduction Application Domains Classification of Schema Matching Approaches Current Work MWSAF Matching Open Research Directories Conclusion Term Definitions • Schema: a set of elements connected by some structure. • Mapping: a set of mapping elements , each of which indicates that certain elements of schema s1 are mapped to certain elements in s2. • Mapping Expression: Tells how s1 and s2 elements are related. Bernstein P, Rahm E. A survey of approaches to automatic schema matching Example S1 Elements S2 Elements Cust Customer C# CustID CName Company FirstName Contact LastName Phone A mapping between s1 and s2 might contain these elements: • Cust.C#=Customer.CustID • Concatenate(Cust.FirstName, Cust.LastName) = Customer.contact • Cust.CName = Customer.Company Bernstein P, Rahm E. A survey of approaches to automatic schema matching Example Doan A, Halevy A. Semantic Integration Research in the Database Community: A Brief Survey. Classification of Schema Matching Approaches • Instance vs Schema: matching approaches can consider instance data or schema-level information. • Element vs Structure matching: match can be performed for individual schema elements or combinations of elements. • Language vs Constraint: linguistic (names) or constraint-based (keys and relationships). • Matching Cardinality: match result may relate one or more elements of one schema to one or more elements of another. • Auxiliary Information: matcher relies on other information besides the input schemas, such as dictionaries, user input, global schemas. Bernstein P, Rahm E. A survey of approaches to automatic schema matching Classification of Schema Matching Approaches Schema Matching Approaches Individual Matchers Schema-only Element Level Constraint … … •Name Similarity •Description Similarity •Global Namespaces Instance/Contents Structure Level Linguistic •Type Similarity •Key Properties Combining Matchers Constraint … •Group Matching Hybrid Matchers Element Level Manual Composition Linguistic Constraint … … •Word Frequency Composite Matchers •Value Pattern and Ranges Automatic Composition Further Criteria -Match Cardinality -Auxiliary information use Sample Approaches Bernstein P, Rahm E. A survey of approaches to automatic schema matching Schema Level Matchers • Consider schema information instead of instance data: Name, Description, Data Type, Relationship Types, Constraints, Structure • Often produces multiple candidates and estimates a degree of similarity for each 1. 2. 3. 4. 5. Granularity of match (element level vs structure level) Match Cardinality Linguistic Approaches: Name or Description Matching Constraint-Based Approaches Reusing Schema and Matching Information Bernstein P, Rahm E. A survey of approaches to automatic schema matching Element-Level • Element-Level: Identifies all elements of S1 that are the same or similar to elements of S2. • The match comparison can be based on name, description, or data type of the element. • Example of name-based element-level matching: Address = CustomerAddress Bernstein P, Rahm E. A survey of approaches to automatic schema matching Structure-Level • Structure-Level: Matches combinations of elements that appear together in S1 with combinations of elements that appear together in S2. • Full Structure Match: S1 Elements S2 Elements Address CustAddress Street Street City City State USState Zip PostalCode • Partial Structure Match: S1 Elements S2 Elements AccountOwner Customer Name Cname Address CAddress Birthdate CPhone TaxExempt • Equivalence Patterns: Can enhance structure matching by considering known equivalence patterns stored in a library. Bernstein P, Rahm E. A survey of approaches to automatic schema Match Cardinality • One or more S1 elements can match one or more S2 elements. • Complex matches Examples of the four local cardinality cases for individual mapping elements. Local Match Cardinalities S1 Element(s) S2 Element(s) Matching Expression 1:1, element level Price Amount Amount = Price n:1, element level Price, Tax Cost Cost = Price*(1+Tax/100) 1:n, element level Name FirstName, LastName FirstName, LastName = Name A.Book, A.Publisher A.Book, A.Publisher = Select B.Title, P.Name From B, P Where B.PuNo = P.PuNo n:m, element level B.Title also B.PuNo, n:1, structure level P.PuNo, P.Name Bernstein P, Rahm E. A survey of approaches to automatic schema matching Complex Matches • 1:1 matches are bounded by the sizes of the schemas but there are an unbounded number of functions for combining attributes in a schema • Only a few works on complex matching have been done. • Some hard code complex matches into rules. • Some rely on a domain specific ontology. • We need domain knowledge to accurately perform complex matching. • The best match isn’t always the top match returned by the matcher – so human involvement is still needed. Doan A, Halevy A. Semantic Integration Research in the Database Community: A Brief Survey. Linguistic Approaches • Language based matchers use names and text (i.e. words or sentences) to find semantically similar schema elements. • Name Matching: match elements with similar names • Description Matching: match comments in the schemas Bernstein P, Rahm E. A survey of approaches to automatic schema matching Linguistic Approaches: Name Matching • Matches schema elements with equal or similar names. • How similarity is defined: 1. Equality of names 2. Equality of names after stemming, deals with prefixes/suffixes. 3. Equality of synonyms 4. Equality of hypernyms (suv is a type of car) 5. Similarity of names based on common substrings, soundex, pronunciation (ShipTo = Ship2) 6. User provided name matches. • Can be element or structure-level. • Cardinality is not limited to 1:1. Bernstein P, Rahm E. A survey of approaches to automatic schema matching Linguistic Approaches: Description Matching • Schemas can contain comments in natural language that express the intended semantics of the schema elements. • Example S1: empn S2: name // employee name // name of employee • Can be as simple as keyword extraction and synonym matching, or as complex as using natural language understanding technology. Bernstein P, Rahm E. A survey of approaches to automatic schema matching Constraint Based • Schemas often contain constraints to define data types and value ranges, optionality, relationship types, cardinalities, etc. Bernstein P, Rahm E. A survey of approaches to automatic schema matching Reusing Schema and Mapping Information • The effectiveness of matching can be improved with the reuse of common schema components and previously determined mappings. • Many schemas are often very similar to each other and previously matched schemas. i.e. In E-Commerce, substructures often repeat within different message formats (address fields, name fields) • A schema library should be created and the schema editors should access the library to use predefined terms and definitions. Bernstein P, Rahm E. A survey of approaches to automatic schema matching Schema Mapping Reuse • Example Schema S1 Schema S Purchase-order Purchase-order Product Product BillTo BillTo Name Name Address Address ShipTo ShipTo Name Name Address Address ContactPhone Contact Name Address Schema S2 POrder Article Payee BillAddress Recipient ShipAddress • Problems: 1. Determining which part of a new schema is similar to some part of a previously matched one is a match problem itself. 2. Similarity values may depend on the domain. i.e. Salary and income may be identical in payroll application but not in a tax reporting application Bernstein P, Rahm E. A survey of approaches to automatic schema matching Instance Level Approaches • Why? 1. Little or no schema information available. 2. Enhancement of schema-level matchers. Instance data gives insight to the contents and meaning of schema elements. 3. To match instance-level data. • How? 1. Preferred Method: Linguistic Characterization 2. Constraint-based Characterization i.e. Ranges 3. Auxiliary Information 4. Also uses both rule-based and learner-based techniques. • Main Problem: When comparing data at the instance-level it is likely that there will be a ton of possible match combinations, a lot of which are irrelevant. Bernstein P, Rahm E. A survey of approaches to automatic schema matching Rule Based Solutions • Rule-Based: hand crafted rules to exploit schema information • element names, data types, structures and subelements. • Ie: two elements match if they have the same name and the same number of subelements Doan A, Halevy A. Semantic Integration Research in the Database Community: A Brief Survey. Learner Based Solutions • Learner-Based: exploit both schema and data. • Requires a lot of training data but can exploit data. • Rule and learner based techniques combined provide an effective matching solution. Doan A, Halevy A. Semantic Integration Research in the Database Community: A Brief Survey. Combining Different Matchers • The ideal matching system must exploit many different types of information and technique for maximum accuracy. • More match candidates will be produced if the previous approaches are combined. • Two Combination Methods: 1. Hybrid: integrates multiple matching criteria. Better performance. 2. Composite: combine the results of independently executed matchers. More flexible. Can be done automatically or manually. Bernstein P, Rahm E. A survey of approaches to automatic schema matching Outline • • • • • • • Introduction Application Domains Classification of Schema Matching Approaches Current Work MWSAF Matching Open Research Directories Conclusion LSD (Univ. of Washington) • Learning Source Descriptions • Uses machine learning techniques to match a new data source against a previously determined global schema. • Uses a name matcher and several instance-level matchers. • System is trained with sample user inputs and it learns patterns and matching rules. • Mostly instance-oriented but can use schema information too. • Also supports user input domain constraints on the global schema. Bernstein P, Rahm E. A survey of approaches to automatic schema matching SKAT (Stanford University) • Semantic Knowledge Articulation Tool • Follows a rule-based approach to semi-automatically determine matches between two ontologies. • User input required: * The user must provide application specific match/mismatch relations. * The user must approve or reject matches. • SKAT matching is used within the ONION architecture for ontology integration. • In ONION, an “articulation ontology” is constructed from the rules. Matching is based on is-a relationships between the articulation ontology and the source ontology. Bernstein P, Rahm E. A survey of approaches to automatic schema matching TransScm (Tel Aviv University) • Uses schema matching to derive an automatic data translation between schema instances. • Schemas are transformed into labeled graphs. • Matching is performed node by node (element-level, 1:1) starting at the top. • Requires user intervention if no match is found (i.e. to provide a new rule). Bernstein P, Rahm E. A survey of approaches to automatic schema matching DIKE (Univ. of Reggio Calabria, Univ. of Calabria) • Compares pairs of objects by their attributes and the is-a relationships that they are involved in. • These pairs are given a match score between 0 and 1. • User must specify synonyms, homonyms, and inclusion properties. Bernstein P, Rahm E. A survey of approaches to automatic schema matching Cupid (Microsoft Research) • • Hybrid matcher Element and Structural-Level matches. Phase 1: Linguistic Element-Level. - categorizes elements based on name, data types, and domains. - calculates a linguistic similarity coefficient. Phase 2: - transform the original schema into a tree then perform a bottom-up structure matching. - calculates a similarity value. - calculates a weighted mean of linguistic and structural similarity of pairs of elements Phase 3: - uses the mean from phase 2 to decide on a mapping. Bernstein P, Rahm E. A survey of approaches to automatic schema matching Clio (IBM Almaden and Univ. of Toronto) • Aims at a semi-automatic creation of match mappings between a given target schema and a new data source schema. • Three Components: Schema Readers: read schema and translate it into an internal representation. Correspondence Engine: is used to identify matching parts of the schemas or databases. Mapping Generator: generates view definitions to map data in the source schema to data in the target schema. Bernstein P, Rahm E. A survey of approaches to automatic schema matching Similarity flooding (Stanford Univ. and Univ. of Leipzig) • Graph Matching Algorithm. • Converts schemas into directed labeled graphs and determines the matches between corresponding nodes of the graphs. • Uses a name matcher to get an initial elementlevel match that is then given to the structural matcher. Bernstein P, Rahm E. A survey of approaches to automatic schema matching Delta (Mitre) • Uses attribute descriptions to determine attribute matches. • The method is to group the metadata about an attribute into a text string which is presented as a document. The user is then presented with other ‘documents’ with matching attributes and can chose from those. Bernstein P, Rahm E. A survey of approaches to automatic schema matching Tess (Univ. of Massachusetts, Amherst) • System for helping to cope with schema evolution. • Takes a definition of the old schema and produces a program that will transform data that conforms to the old schema into data that conforms to the new schema. Bernstein P, Rahm E. A survey of approaches to automatic schema matching Outline • • • • • • • Introduction Application Domains Classification of Schema Matching Approaches Current Work MWSAF Matching Open Research Directories Conclusion MWSAF: Meteor-S Web Service Annotation Framework LSDIS Lab, UGA • What is it? A tool for semi-automatically marking up web service descriptions with ontologies. It helps in describing services semantically and aids in efficient web service discovery and composition. MWSAF Annotation Tool • Input: WSDL File 1. 2. 3. 4. • Individual elements of the WSDL are matched to concepts in the domain The WSDL is classified into a domain. The Matches are given to the user to accept or reject. Upon the user’s acceptance, the annotations are written to the WSDL. Output: WSDL File with semantic annotations MWSAF Architecture Main Components of the System: 1. Ontology Store: stores the DAML and RDF ontologies that will be used to annotate the WSDL files. Ontologies are categorized by domain. 2. Parser Library: consists of the parsers used to generate the SchemaGraphs. 3. Matcher Library: provides schema matching algorithm. Patil A, Oundhakar S, Sheth A, Verma K. METEOR-S Web service Annotation Framework MWSAF Schema Graphs PROBLEM: The difference in expressiveness of XML Schema and ontology makes it very difficult to match these two models directly. MWSAF converts both models to a common representation format called SchemaGraph. A SchemaGraph is a set of nodes connected by edges that are created using conversion functions. Then it applies a matching algorithm to find the mappings between them. Patil A, Oundhakar S, Sheth A, Verma K. METEOR-S Web service Annotation Framework MWSAF: Meteor-S Web Service Annotation Framework XML to SchemaGraph conversion rules <xsd:complexType name="Direction"> <xsd:sequence> <xsd:element maxOccurs="1" minOccurs="1" name="compass" nillable="true" type="xsd1:DirectionCompass" /> <xsd:element maxOccurs="1" minOccurs="1" name="degrees" type="xsd:int" /> </xsd:sequence> </xsd:complexType> Direction compass hasElement degrees Direction Compass SchemaNode representation of XML schema Patil A, Oundhakar S, Sheth A, Verma K. METEOR-S Web service Annotation Framework. MWSAF: Meteor-S Web Service Annotation Framework Ontology to SchemaGraph conversion rules <daml:Class rdf:ID="WindEvent"> <rdfs:comment>Superclass for all events dealing with wind</rdfs:comment> <rdfs:label>Wind event</rdfs:label> <rdfs:subClassOf rdf:resource="#WeatherEvent" /> </daml:Class> <daml:Property rdf:ID="windDirection"> <rdfs:label>Wind direction</rdfs:label> <rdfs:domain rdf:resource="#WindEvent" /> <rdfs:range rdf:resource = "http://www.w3.org/2000/10/XMLSchema#string" /> </daml:Property><daml:Property rdf:ID="windSpeed"> <rdfs:label>Wind speed</rdfs:label> <rdfs:domain rdf:resource="#WindEvent" /> WindEvent <rdfs:range rdf:resource="#Speed" /> </daml:Property> hasProperty windSpeed windDirection Patil A, Oundhakar S, Sheth A, Verma K. METEOR-S Web service Annotation Framework. Speed SchemaGraph representation of part of ontology Mapping • Measures of the Match Score: -Element Level Match: linguistic similarity of two concepts based on names. Uses WordNet to check for synonyms. Abbreviations are even checked. -Schema Match: structural similarity, sub-concept similarities. • The getBestMapping function then looks at the Match Scores and determines a map set. Patil A, Oundhakar S, Sheth A, Verma K. METEOR-S Web service Annotation Framework MWSAF Matching Techniques: ElemMatch • Name and String Matching algorithms: -NGram: considers the number of qgrams that the names have in common. -CheckSynonym: uses Wordnet to find synonyms. -CheckAbbreviations: uses an abbreviation dictionary. -TokenMatcher: uses Porter Stemmer tonkenization and substring matching techniques. • Each algorithm returns a value between 0 and 1. These values are used in an equation for the final match score. Patil A, Oundhakar S, Sheth A, Verma K. METEOR-S Web service Annotation Framework Matching • Once Each WSDL is compared against all of the ontologies in the store and a mapping has been created for each ontology, Then two measures are derived from the mapping: -Average Concept Match: tells the user about the degree of similarity between matched concepts of the WSDL and ontology. -Average Service Match: helps to categorize the service. *We have a machine learning alternative for categorization! Patil A, Oundhakar S, Sheth A, Verma K. METEOR-S Web service Annotation Framework Outline • • • • • • • Introduction Application Domains Classification of Schema Matching Approaches Current Work MWSAF Matching Open Research Directories Conclusion Current and Future Issues • User Interaction: minimize user input but maximize impact of the feedback • Real World Analysis: can the current matching techniques be used in real world situations? • P2P data management • Mapping Maintenance: what happens when you map between two schemas and then one changes? • Developing global schemas (or ontologies) for domains. • Dealing with inconsistent data values for a schema element. Doan A, Halevy A. Semantic Integration Research in the Database Community: A Brief Survey. More Issues • If we require user acceptance for our matches, then what happens if our matcher returns thousands or hundreds of matches? • Is it unrealistic to think that we will eventually perfect our matchers? Doan A, Halevy A. Semantic Integration Research in the Database Community: A Brief Survey. Conclusion • It is necessary to automate the matching process. • Schema matching is very difficult and expensive. • We have looked at a taxonomy and the descriptions of the existing approaches for matching. -Schema vs Instance-level -Element vs Structure-level -Language and Constraint based matchers. • We also discussed several implementations of the matching techniques. References • Bernstein P, Rahm E. A survey of approaches to automatic schema matching. www.research.microsoft.com/~philbe/VLDBJ-Dec2001.pdf • Doan A, Halevy A. Semantic Integration Research in the Database Community: A Brief Survey. http://anhai.cs.uiuc.edu/public/db-review14.pdf • Patil A, Oundhakar S, Sheth A, Verma K. METEOR-S Web service Annotation Framework. POSV-WWW2004.pdf • Vassilis C, Integrating XML Data Sources using RDF/S Schemas: The ICS-FORTH Semantic Web Integration Middleware (SWIM). Dagsthul Seminar ftp://ftp.dagstuhl.de/pub/Proceedings/04/04391/04391.ChristophidesVassilis.Slides.p df Questions