Schema Matching Presentation

advertisement
Automatic Schema Matching
Nicole Oldham
CSCI 8350
(Semantic Web Course
@ Univ of Georgia)
Topic Presentation
Outline
•
•
•
•
•
•
•
Introduction
Application Domains
Classification of Schema Matching Approaches
Current Work
MWSAF Matching
Open Research Directories
Conclusion
Schema Matching
• Match: Takes two schemas as input and produces a
mapping between the elements that correspond to
each other semantically.
• It is usually performed manually.
-
Tedious
Time Consuming
Error Prone
Expensive
We must automate this process!
Example
• GTE telecommunications needed to
integrate 40 databases with a total of 27,000
elements.
• Project planners estimated that manual
matching would take 12 person years to
integrate.
Doan A, Halevy A. Semantic Integration Research in the Database Community: A Brief Survey.
Various Levels of Heterogenity
ftp://ftp.dagstuhl.de/pub/Proceedings/04/04391/04391.ChristophidesVassilis.Slides.pdf
How to deal with Semantic
Heterogenity
1. Standardize: agree on a common representation
2. Translate: create mappings between different schemas
􀂾 -requires human input and machine reasoning
􀂾 -mappings can be difficult and expensive
3. Annotate: create relationships between agreed upon
conceptualizations
􀂾 -requires human input and machine reasoning
􀂾 -annotation can be difficult and expensive
􀂾
ftp://ftp.dagstuhl.de/pub/Proceedings/04/04391/04391.ChristophidesVassilis.Slides.pdf
Challenges
• Actual semantics of the involved elements are typically only from the
creators or documentation – so we must use clues in the schema and data
instead.
• These clues are often misleading.
• Ie. ‘Area’ can refer to different entities
• Ie. The same entities can have very different names.
• Clues are often ambiguous.
• Ie. ‘Contact-agent’ Agent name or phone number?
• Matching process can be very costly
• Each element of the schema must be examined to ensure discovery of
the best match.
• Matching is often subjective depending on the application.
Doan A, Halevy A. Semantic Integration Research in the Database Community: A Brief Survey.
Outline
•
•
•
•
•
•
•
Introduction
Application Domains
Classification of Schema Matching Approaches
Current Work
MWSAF Matching
Open Research Directories
Conclusion
Where is Schema Matching
used?
• Database Application Domains
-
Data Integration
Data Warehousing
E-Business
Query Processing
• Semantic Web
- XML/HTML to an Ontology
- Semantic Web Services
Bernstein P, Rahm E. A survey of approaches to automatic schema matching
Schema Integration
Problem: Construct a global view from a set of
independently constructed schemas.
(ie: ontologies)
- Different structure and terminologies
Solution: Schema Matching is performed to find
relationships between concepts in each schema.
Then the matching elements can be unified.
Bernstein P, Rahm E. A survey of approaches to automatic schema matching
Data Warehouses
Problem: Integrating data sources into a data
warehouse.
- Different formats between the source and
warehouse.
Solution: Use matching to find the elements of the
source that are also present in the warehouse.
Then the details of the semantics can be examined
to integrate the two.
Bernstein P, Rahm E. A survey of approaches to automatic schema matching
E-Commerce
Problem: Message translation.
-Each trading partner uses its own message format.
Solution: A match operation would reduce the amount
of manual work to specify how the formats are
related.
Bernstein P, Rahm E. A survey of approaches to automatic schema matching
Query Processing
Problem: The terms used in the user’s query may be
different from those in the database.
Solution: Matching is used to map the user-specified
concepts in the query to schema elements.
Bernstein P, Rahm E. A survey of approaches to automatic schema matching
Need for Data Integration on the
Semantic Web
• Problem: Web documents are not in RDF or any
form suitable for the SW.
• We must annotate them with concepts from
ontologies.
• Solution: Use schema matching to map between
elements represented in OWL and the different
schemas of web documents.
Semantic Web Services
• Problem: Web Services are currently searched for
using keywords.
• We need to annotate the WSDLs with semantic
metadata so that they can be discovered
efficiently.
• WSDLs are in XML, Ontologies in OWL!
• Solution: Use schema matching approaches to map
between the two different schemas.
Outline
•
•
•
•
•
•
•
Introduction
Application Domains
Classification of Schema Matching Approaches
Current Work
MWSAF Matching
Open Research Directories
Conclusion
Term Definitions
• Schema: a set of elements connected by some
structure.
• Mapping: a set of mapping elements , each of
which indicates that certain elements of schema s1
are mapped to certain elements in s2.
• Mapping Expression: Tells how s1 and s2
elements are related.
Bernstein P, Rahm E. A survey of approaches to automatic schema matching
Example
S1 Elements
S2 Elements
Cust
Customer
C#
CustID
CName
Company
FirstName
Contact
LastName
Phone
A mapping between s1 and s2 might contain these elements:
• Cust.C#=Customer.CustID
• Concatenate(Cust.FirstName, Cust.LastName) = Customer.contact
• Cust.CName = Customer.Company
Bernstein P, Rahm E. A survey of approaches to automatic schema matching
Example
Doan A, Halevy A. Semantic Integration Research in the Database Community: A Brief Survey.
Classification of Schema Matching
Approaches
• Instance vs Schema: matching approaches can consider instance
data or schema-level information.
• Element vs Structure matching: match can be performed for
individual schema elements or combinations of elements.
• Language vs Constraint: linguistic (names) or constraint-based
(keys and relationships).
• Matching Cardinality: match result may relate one or more
elements of one schema to one or more elements of another.
• Auxiliary Information: matcher relies on other information besides
the input schemas, such as dictionaries, user input, global schemas.
Bernstein P, Rahm E. A survey of approaches to automatic schema matching
Classification of Schema Matching
Approaches
Schema Matching Approaches
Individual Matchers
Schema-only
Element Level
Constraint
…
…
•Name Similarity
•Description Similarity
•Global Namespaces
Instance/Contents
Structure Level
Linguistic
•Type
Similarity
•Key
Properties
Combining Matchers
Constraint
…
•Group
Matching
Hybrid Matchers
Element Level
Manual Composition
Linguistic
Constraint
…
…
•Word
Frequency
Composite Matchers
•Value
Pattern and
Ranges
Automatic Composition
Further Criteria
-Match Cardinality
-Auxiliary information use
Sample Approaches
Bernstein P, Rahm E. A survey of approaches to automatic schema matching
Schema Level Matchers
•
Consider schema information instead of instance data:
Name, Description, Data Type, Relationship Types,
Constraints, Structure
•
Often produces multiple candidates and estimates a degree of
similarity for each
1.
2.
3.
4.
5.
Granularity of match (element level vs structure level)
Match Cardinality
Linguistic Approaches: Name or Description Matching
Constraint-Based Approaches
Reusing Schema and Matching Information
Bernstein P, Rahm E. A survey of approaches to automatic schema matching
Element-Level
• Element-Level: Identifies all elements of S1 that
are the same or similar to elements of S2.
• The match comparison can be based on name,
description, or data type of the element.
• Example of name-based element-level matching:
Address = CustomerAddress
Bernstein P, Rahm E. A survey of approaches to automatic schema matching
Structure-Level
• Structure-Level: Matches combinations of elements that appear together in S1
with combinations of elements that appear together in S2.
• Full Structure Match: S1 Elements
S2 Elements
Address
CustAddress
Street
Street
City
City
State
USState
Zip
PostalCode
• Partial Structure Match:
S1 Elements
S2 Elements
AccountOwner
Customer
Name
Cname
Address
CAddress
Birthdate
CPhone
TaxExempt
• Equivalence Patterns: Can enhance structure matching by considering known
equivalence patterns stored in a library.
Bernstein P, Rahm E. A survey of approaches to automatic schema
Match Cardinality
• One or more S1 elements can match one or
more S2 elements.
• Complex matches
Examples of the four local cardinality cases for individual mapping elements.
Local Match
Cardinalities
S1 Element(s)
S2 Element(s)
Matching Expression
1:1, element level
Price
Amount
Amount = Price
n:1, element level
Price, Tax
Cost
Cost = Price*(1+Tax/100)
1:n, element level
Name
FirstName,
LastName
FirstName, LastName = Name
A.Book,
A.Publisher
A.Book, A.Publisher = Select
B.Title, P.Name From B, P
Where B.PuNo = P.PuNo
n:m, element level B.Title
also
B.PuNo,
n:1, structure level P.PuNo,
P.Name
Bernstein P, Rahm E. A survey of approaches to automatic schema matching
Complex Matches
• 1:1 matches are bounded by the sizes of the schemas but there are an
unbounded number of functions for combining attributes in a schema
• Only a few works on complex matching have been done.
• Some hard code complex matches into rules.
• Some rely on a domain specific ontology.
• We need domain knowledge to accurately perform complex matching.
• The best match isn’t always the top match returned by the matcher – so
human involvement is still needed.
Doan A, Halevy A. Semantic Integration Research in the Database Community: A Brief Survey.
Linguistic Approaches
• Language based matchers use names and text (i.e. words or
sentences) to find semantically similar schema elements.
• Name Matching: match elements with similar names
• Description Matching: match comments in the schemas
Bernstein P, Rahm E. A survey of approaches to automatic schema matching
Linguistic Approaches:
Name Matching
• Matches schema elements with equal or similar names.
• How similarity is defined:
1. Equality of names
2. Equality of names after stemming, deals with prefixes/suffixes.
3. Equality of synonyms
4. Equality of hypernyms (suv is a type of car)
5. Similarity of names based on common substrings, soundex,
pronunciation (ShipTo = Ship2)
6. User provided name matches.
• Can be element or structure-level.
• Cardinality is not limited to 1:1.
Bernstein P, Rahm E. A survey of approaches to automatic schema matching
Linguistic Approaches:
Description Matching
• Schemas can contain comments in natural language that express the
intended semantics of the schema elements.
• Example
S1: empn
S2: name
// employee name
// name of employee
• Can be as simple as keyword extraction and synonym matching, or as
complex as using natural language understanding technology.
Bernstein P, Rahm E. A survey of approaches to automatic schema matching
Constraint Based
• Schemas often contain constraints to define data types
and value ranges, optionality, relationship types,
cardinalities, etc.
Bernstein P, Rahm E. A survey of approaches to automatic schema matching
Reusing Schema and Mapping
Information
• The effectiveness of matching can be improved with the reuse of
common schema components and previously determined mappings.
• Many schemas are often very similar to each other and previously
matched schemas.
i.e. In E-Commerce, substructures often repeat within
different message formats (address fields, name fields)
• A schema library should be created and the schema editors should
access the library to use predefined terms and definitions.
Bernstein P, Rahm E. A survey of approaches to automatic schema matching
Schema Mapping Reuse
• Example
Schema S1
Schema S
Purchase-order Purchase-order
Product
Product
BillTo
BillTo
Name
Name
Address
Address
ShipTo
ShipTo
Name
Name
Address
Address
ContactPhone
Contact
Name
Address
Schema S2
POrder
Article
Payee
BillAddress
Recipient
ShipAddress
• Problems:
1. Determining which part of a new schema is similar to some part
of a previously matched one is a match problem itself.
2. Similarity values may depend on the domain. i.e. Salary and
income may be identical in payroll application but not in a tax
reporting application
Bernstein P, Rahm E. A survey of approaches to automatic schema matching
Instance Level Approaches
• Why?
1. Little or no schema information available.
2. Enhancement of schema-level matchers. Instance data gives insight to
the contents and meaning of schema elements.
3. To match instance-level data.
• How?
1. Preferred Method: Linguistic Characterization
2. Constraint-based Characterization
i.e. Ranges
3. Auxiliary Information
4. Also uses both rule-based and learner-based techniques.
• Main Problem: When comparing data at the instance-level it is likely that
there will be a ton of possible match combinations, a lot of which are
irrelevant.
Bernstein P, Rahm E. A survey of approaches to automatic schema matching
Rule Based Solutions
• Rule-Based: hand crafted rules to exploit
schema information
• element names, data types, structures and
subelements.
• Ie: two elements match if they have the same
name and the same number of subelements
Doan A, Halevy A. Semantic Integration Research in the Database Community: A Brief Survey.
Learner Based Solutions
• Learner-Based: exploit both schema and data.
• Requires a lot of training data but can exploit data.
• Rule and learner based techniques combined
provide an effective matching solution.
Doan A, Halevy A. Semantic Integration Research in the Database Community: A Brief Survey.
Combining Different Matchers
• The ideal matching system must exploit many different types of
information and technique for maximum accuracy.
• More match candidates will be produced if the previous approaches are
combined.
• Two Combination Methods:
1. Hybrid: integrates multiple matching criteria.
Better performance.
2. Composite: combine the results of independently executed matchers.
More flexible.
Can be done automatically or manually.
Bernstein P, Rahm E. A survey of approaches to automatic schema matching
Outline
•
•
•
•
•
•
•
Introduction
Application Domains
Classification of Schema Matching Approaches
Current Work
MWSAF Matching
Open Research Directories
Conclusion
LSD (Univ. of Washington)
• Learning Source Descriptions
• Uses machine learning techniques to match a new data source against a
previously determined global schema.
• Uses a name matcher and several instance-level matchers.
• System is trained with sample user inputs and it learns patterns and
matching rules.
• Mostly instance-oriented but can use schema information too.
• Also supports user input domain constraints on the global schema.
Bernstein P, Rahm E. A survey of approaches to automatic schema matching
SKAT (Stanford University)
• Semantic Knowledge Articulation Tool
• Follows a rule-based approach to semi-automatically determine
matches between two ontologies.
• User input required:
* The user must provide application specific match/mismatch
relations.
* The user must approve or reject matches.
• SKAT matching is used within the ONION architecture for
ontology integration.
• In ONION, an “articulation ontology” is constructed from the
rules. Matching is based on is-a relationships between the
articulation ontology and the source ontology.
Bernstein P, Rahm E. A survey of approaches to automatic schema matching
TransScm (Tel Aviv University)
• Uses schema matching to derive an automatic data translation between
schema instances.
• Schemas are transformed into labeled graphs.
• Matching is performed node by node (element-level, 1:1) starting at
the top.
• Requires user intervention if no match is found (i.e. to provide a new
rule).
Bernstein P, Rahm E. A survey of approaches to automatic schema matching
DIKE (Univ. of Reggio
Calabria, Univ. of Calabria)
• Compares pairs of objects by their attributes and the is-a relationships
that they are involved in.
• These pairs are given a match score between 0 and 1.
• User must specify synonyms, homonyms, and inclusion properties.
Bernstein P, Rahm E. A survey of approaches to automatic schema matching
Cupid (Microsoft Research)
•
•
Hybrid matcher
Element and Structural-Level matches.
Phase 1:
Linguistic Element-Level.
- categorizes elements based on name, data types, and domains.
- calculates a linguistic similarity coefficient.
Phase 2:
- transform the original schema into a tree then perform a bottom-up
structure matching.
- calculates a similarity value.
- calculates a weighted mean of linguistic and structural similarity of
pairs of elements
Phase 3:
- uses the mean from phase 2 to decide on a mapping.
Bernstein P, Rahm E. A survey of approaches to automatic schema matching
Clio (IBM Almaden and Univ.
of Toronto)
• Aims at a semi-automatic creation of match mappings
between a given target schema and a new data source
schema.
• Three Components:
Schema Readers: read schema and translate it into an
internal representation.
Correspondence Engine: is used to identify matching parts
of the schemas or databases.
Mapping Generator: generates view definitions to map
data in the source schema to data in the target schema.
Bernstein P, Rahm E. A survey of approaches to automatic schema matching
Similarity flooding (Stanford
Univ. and Univ. of Leipzig)
• Graph Matching Algorithm.
• Converts schemas into directed labeled graphs and
determines the matches between corresponding
nodes of the graphs.
• Uses a name matcher to get an initial elementlevel match that is then given to the structural
matcher.
Bernstein P, Rahm E. A survey of approaches to automatic schema matching
Delta (Mitre)
• Uses attribute descriptions to determine attribute matches.
• The method is to group the metadata about an attribute into
a text string which is presented as a document. The user is
then presented with other ‘documents’ with matching
attributes and can chose from those.
Bernstein P, Rahm E. A survey of approaches to automatic schema matching
Tess (Univ. of Massachusetts,
Amherst)
• System for helping to cope with schema evolution.
• Takes a definition of the old schema and produces a
program that will transform data that conforms to the old
schema into data that conforms to the new schema.
Bernstein P, Rahm E. A survey of approaches to automatic schema matching
Outline
•
•
•
•
•
•
•
Introduction
Application Domains
Classification of Schema Matching Approaches
Current Work
MWSAF Matching
Open Research Directories
Conclusion
MWSAF: Meteor-S Web Service Annotation
Framework
LSDIS Lab, UGA
• What is it?
A tool for semi-automatically marking up web
service descriptions with ontologies.
It helps in describing services semantically and
aids in efficient web service discovery and
composition.
MWSAF Annotation Tool
•
Input: WSDL File
1.
2.
3.
4.
•
Individual elements of the WSDL are matched to
concepts in the domain
The WSDL is classified into a domain.
The Matches are given to the user to accept or reject.
Upon the user’s acceptance, the annotations are
written to the WSDL.
Output: WSDL File with semantic annotations
MWSAF Architecture
Main Components of the System:
1.
Ontology Store: stores the DAML and RDF ontologies
that will be used to annotate the WSDL files. Ontologies
are categorized by domain.
2.
Parser Library: consists of the parsers used to generate
the SchemaGraphs.
3.
Matcher Library: provides schema matching algorithm.
Patil A, Oundhakar S, Sheth A, Verma K. METEOR-S Web service Annotation Framework
MWSAF
Schema Graphs
PROBLEM: The difference in expressiveness of XML
Schema and ontology makes it very difficult to match
these two models directly.
MWSAF converts both models to a common
representation format called SchemaGraph.
A SchemaGraph is a set of nodes connected by edges that are
created using conversion functions.
Then it applies a matching algorithm to find the
mappings between them.
Patil A, Oundhakar S, Sheth A, Verma K. METEOR-S Web service Annotation Framework
MWSAF: Meteor-S Web Service Annotation
Framework
XML to SchemaGraph conversion rules
<xsd:complexType name="Direction">
<xsd:sequence>
<xsd:element maxOccurs="1" minOccurs="1"
name="compass" nillable="true"
type="xsd1:DirectionCompass" />
<xsd:element maxOccurs="1" minOccurs="1"
name="degrees" type="xsd:int" />
</xsd:sequence>
</xsd:complexType>
Direction
compass
hasElement
degrees
Direction
Compass
SchemaNode representation of XML schema
Patil A, Oundhakar S, Sheth A, Verma K. METEOR-S Web service Annotation Framework.
MWSAF: Meteor-S Web Service Annotation Framework
Ontology to SchemaGraph conversion rules
<daml:Class rdf:ID="WindEvent">
<rdfs:comment>Superclass for all events dealing with wind</rdfs:comment>
<rdfs:label>Wind event</rdfs:label>
<rdfs:subClassOf rdf:resource="#WeatherEvent" />
</daml:Class>
<daml:Property rdf:ID="windDirection">
<rdfs:label>Wind direction</rdfs:label>
<rdfs:domain rdf:resource="#WindEvent" />
<rdfs:range rdf:resource = "http://www.w3.org/2000/10/XMLSchema#string" />
</daml:Property><daml:Property rdf:ID="windSpeed">
<rdfs:label>Wind speed</rdfs:label>
<rdfs:domain rdf:resource="#WindEvent" />
WindEvent
<rdfs:range rdf:resource="#Speed" />
</daml:Property>
hasProperty
windSpeed
windDirection
Patil A, Oundhakar S, Sheth A, Verma K. METEOR-S Web service
Annotation Framework.
Speed
SchemaGraph representation of part of
ontology
Mapping
• Measures of the Match Score:
-Element Level Match: linguistic similarity of two
concepts based on names. Uses WordNet to check for
synonyms. Abbreviations are even checked.
-Schema Match: structural similarity, sub-concept
similarities.
• The getBestMapping function then looks at the Match
Scores and determines a map set.
Patil A, Oundhakar S, Sheth A, Verma K. METEOR-S Web service Annotation Framework
MWSAF Matching Techniques:
ElemMatch
• Name and String Matching algorithms:
-NGram: considers the number of qgrams that the names
have in common.
-CheckSynonym: uses Wordnet to find synonyms.
-CheckAbbreviations: uses an abbreviation dictionary.
-TokenMatcher: uses Porter Stemmer tonkenization and
substring matching techniques.
• Each algorithm returns a value between 0 and 1. These
values are used in an equation for the final match score.
Patil A, Oundhakar S, Sheth A, Verma K. METEOR-S Web service Annotation Framework
Matching
• Once Each WSDL is compared against all of the
ontologies in the store and a mapping has been created for
each ontology,
Then two measures are derived from the mapping:
-Average Concept Match: tells the user about the degree of
similarity between matched concepts of the WSDL and
ontology.
-Average Service Match: helps to categorize the service.
*We have a machine learning alternative for categorization!
Patil A, Oundhakar S, Sheth A, Verma K. METEOR-S Web service Annotation Framework
Outline
•
•
•
•
•
•
•
Introduction
Application Domains
Classification of Schema Matching Approaches
Current Work
MWSAF Matching
Open Research Directories
Conclusion
Current and Future Issues
• User Interaction: minimize user input but maximize impact of the
feedback
• Real World Analysis: can the current matching techniques be used in
real world situations?
• P2P data management
• Mapping Maintenance: what happens when you map between two
schemas and then one changes?
• Developing global schemas (or ontologies) for domains.
• Dealing with inconsistent data values for a schema element.
Doan A, Halevy A. Semantic Integration Research in the Database Community: A Brief Survey.
More Issues
• If we require user acceptance for our matches, then what
happens if our matcher returns thousands or hundreds of
matches?
• Is it unrealistic to think that we will eventually perfect our
matchers?
Doan A, Halevy A. Semantic Integration Research in the Database Community: A Brief Survey.
Conclusion
• It is necessary to automate the matching process.
• Schema matching is very difficult and expensive.
• We have looked at a taxonomy and the descriptions of the
existing approaches for matching.
-Schema vs Instance-level
-Element vs Structure-level
-Language and Constraint based matchers.
• We also discussed several implementations of the matching
techniques.
References
•
Bernstein P, Rahm E. A survey of approaches to automatic schema matching.
www.research.microsoft.com/~philbe/VLDBJ-Dec2001.pdf
•
Doan A, Halevy A. Semantic Integration Research in the Database Community: A
Brief Survey.
http://anhai.cs.uiuc.edu/public/db-review14.pdf
•
Patil A, Oundhakar S, Sheth A, Verma K. METEOR-S Web service Annotation
Framework. POSV-WWW2004.pdf
•
Vassilis C, Integrating XML Data Sources using RDF/S Schemas: The ICS-FORTH
Semantic Web Integration Middleware (SWIM). Dagsthul Seminar
ftp://ftp.dagstuhl.de/pub/Proceedings/04/04391/04391.ChristophidesVassilis.Slides.p
df
Questions
Download