iMAP: Discovering Complex Semantic Matches between Database

advertisement
ITCS6010
Fall 2008
Anuradha Venkataraman
800556407

Semantic Mapping :

It is a 2 step process,
specifies how data in two different
data sources are related and how to transform data from one
data source to another
find semantic matches between attributes of two
schemas
◦
◦


location=concat(city,state)
find mappings to transform data in one schema to
another based on the matches found(SQL queries,
Clio)
location=select concat(city,state) from S2


Semantic mapping is a fundamental task in DI or
other data sharing systems like DW, EDI , co
operations and collaborations
At the time the paper was written semantic mapping
was done through a manual process. This was time
consuming and also prone to error.

Majority of prior related work focused on creating 1 1 matches

Complex matches are common in real-world
schemas

1-1 matches - an attribute in the source schema
is matched to another attribute in the target
schema
 name=custname

Complex matching - specifies a combination of
attributes in the source schema that relate to a
combination of attributes in the target schema
 name=concat(fname,lname)
◦ Finding complex matches is more difficult because the
search space is unbounded.


Schema matching in relational schemas - idea can be used
for other types as well
Goal to provide an interactive design environment where a
human can create mappings between schemas quickly by
using the system’s suggestions.

Create a semi-automatic system to find complex matches
between two schemas.

Find complex matches.

Use domain knowledge, external data and overlap data

Match Generation by using search techniques over the solution space

Uses a number of searchers(search modules), each searcher searches a subset of
the search space.
◦
Text searcher – searches matches for textual target attributes, uses concatenation operator

Beam search is deployed to control the search through the search space

Machine learning, statistics and heuristics are used to evaluate candidate matches.

Diminishing returns principle is used to determine when the search needs to be
terminated

Considers name similarity, domain knowledge, integrity constraints, external data
and overlap data to re rank and select the best matches

Provides an explanation module to provide the user with reasons for taking
various decisions at various stages

Semi-automatically discovering complex
matches that combines search through a set
of candidate matches and methods for
evaluating each match

Uses of new kinds of domain knowledge
(overlap data and mining external data)

A mechanism for explaining the decisions
made by the matching system
1-1 and complex matches
User
Match selector
Domain
knowledge
and data
Similarity matrix
Similarity estimator
Match candidates
Match generator
Target schema T and source schema S
Explanation
module

Match Generator

Similarity Estimator

Match Selector
◦ 2 input schemas( S - source schema, T - target schema)
◦ For each attribute t in T, generates possible match
candidates by employing various searchers
◦ Computes similarity score for the match candidates
indicating the level of similarity to target attribute t
◦ Outputs similarity score matrix
◦ Selects the best matches based on the similarity score and
domain constraints


The three modules use external data and
domain knowledge and overlap data during
the processing to improve accuracy and
efficiency.
The modules also interact with an
explanation module to provide explanations
to various actions that the module performs



Search over the search space of possible
matches
Search space is extremely large
Uses multi-searcher strategy
◦ Each searcher has a specific purpose
◦ Each searcher searches only a subset of the search
space

Advantages:
◦ Makes the system extensible – new searchers can
be added and integrated
◦ Each searcher considers small , relevant portions of
the search space

Search Strategy:
◦ Uses beam search to control the search process
◦ Evaluates each candidate based on a scoring function
and only retains k – top matches at each level.

Match Evaluation
◦ Assigns score based on semantic distance between the
candidate match and target attribute.
◦ Uses machine learning, statistics, heuristics.

Termination
◦ Uses diminishing returns principle
◦ Stops when the difference between the best matches in
consecutive levels of iterations is less than a threshold.

Target attributes that are numeric
 price, total, age

Restricted to common operations such as add,
subtract, divide, multiply
 total=qty*unitprice

Considers similarity between the value
distributions of target attribute and the complex
match candidate using the Kullback-Leibler
divergence measure.
◦ This method calculates the divergence between two
distributions based on information and probabilistic
theory.

Examines concatenation of text attributes in S
 location=concat(city,state,country)

Uses Naïve Bayes Text Classifier to calculate
score
 This classifier is trained based on the data in the target
schema to learn the target attribute, returns
probability value for each data instance from source
mapping, the average is the score of the mapping.

Starts with 1-1 matches and proceeds with
concatenation using beam search strategy

Target attribute t is categorical if the data
instances have less than x distinct values
 product-category=product-type

Finds conversions between categorical
attributes
 waterfront=f(near-water)

Uses KL divergence measures to determine
similarity


Relates data of a schema to the schema of the
other
Focuses on binary attributes in target
schema(yes/no)
 mp3support = {yes/no}

Identifies binary attributes in T and searches for
the attribute name in the data of S
◦ description =“…plays mp3,wav…”

Transforms source attribute s into a categorical
attribute of t if it contains more than x instances
of the attribute name of t.
 mp3support = yes if description contains mp3
no if description does not contain mp3

Finds matches that are a conversion between
two different types of units
 length-mm=10 * size-cm

Determines physical attributes by looking for
the presence of units in the data or in the
attribute name

Finds complex matches for date attributes
 Bdate=bday/bmonth/byear

Uses ontology to identify date or part of date
attributes
 Bdate is ontology Date;bday- ontology day, etc.

Uses ontology to identify relationships to
determine the type of conversion.


Employs scoring techniques that can not be
used in the searchers due to efficiency
reasons.
Uses two modules:
◦ Name base evaluator – similarity between the
names(concatenation of complex match) and target
attribute
◦ A Naïve Bayes evaluator – based on a naïve based
classifier

Selects the best matches based on the scores
and other domain integrity constraints .
 Name does not contain numbers
 Match name=login_name {avenkat5} is ranked lower

Uses domain knowledge to clean up complex
matches
 Candidate match - total = price*(qty – product_id),
pid=product_id
 Domain constraint – total and pid are not related.
 iMap drops product_id to get total=price*qty

Directs search process and prunes meaningless candidates early
 Knowledge – fname and location are not related; searcher does not consider
complex matches that have both attributes together

Uses domain knowledge as early as possible to prune the search
space

Types of Domain Knowledge used
 Domain Constraints –
◦
present in schema or provided by user.
Used in various phases based on type of constraint

average of total number of rooms is less that 10, all matches that evaluate to more than 10 are
ignored. This decision might be postponed to a later a stage if evaluation is time consuming.
 Past Complex Matches –
domains.
◦
Uses knowledge gained from previous matching of schemas in similar
Extracts expression template for complex operations which are used by the searchers
Previous match : cost=price(1 + .50); template – attribute1=variable(1+constant), matcher looks for
similar expressions in complex mapping.
External Data – mines properties of attributes from external sources. These properties are used as
domain constraints in various stages.


◦
External data source generally provided by user( domain expert ).

Information from mined data – average unit cost does not exceed 1000$.




In real-world schemas source and target
schemas mostly share some data .
Common tuples or data that represent same
entity.
This overlap data provides important
information on the mappings of the attributes
which can be used.
Special overlap searchers are used when there
is overlap data in source and target schemas.

Overlap Text Searcher
◦ Used instead of text searcher
◦ Uses overlap data to revaluate the mappings generated
by the text searcher.
◦ New score based on the fraction of overlap data that the
mapping satisfies.
 Overlap data – John Smith = concat(John,Smith)
 Mapping name=concat(fname,city) does not match overlap
data – score 0; name=concat(fname,lname) matches overlap
data – score 1.

Overlap Numeric Searcher
◦ Used instead of numeric searcher
◦ Uses equation discovery system to find the best
arithmetic expression that matches attribute t.

Overlap category and schema mismatch Searcher
◦ Similar technique to Overlap text searcher




Explanations help users better understand the
system.
The system uses complex processes and hence
there is a need to explain decisions to the user.
This helps the user guide the system to find the
correct matches
Type of Questions a user can ask:
◦ Explain a match – why numrooms=baths+beds+floor
generated
◦ Explain why a match is not present – why
price=listprice+agent_fee not generated
◦ Explain the ranking between matches – why price=listprice
better than price=listprice+agent_fee

Questions can be asked to any specific module



Uses a dependency graph that is created as
the process flows through each module
Records matches, assumptions, data
Nodes: attributes, assumptions, candidate
matches, domain knowledge. Connected by
directed edge labeled with the module that
was responsible for the decision .

Domains and Data
◦ Cricket, Inventory, Financial, Real estate
◦ Collected from various sources
◦ Use both schemas with overlap data as well as disjoint
schemas

Performance Measure
◦ Top 1 matching accuracy - % of matches where the best
match for a attribute is the correct match
◦ Top 3 matching accuracy - % of matches where the top 3
ranked matches for a attribute contains the correct
match – more relevant as the tool is used only to
suggest a rank list of matches.

Overlap Data

Disjoint Data



◦ Default iMap – 58-74 % accuracy
◦ Exploiting domain knowledge – 68-92 %
◦ Default iMap – 55-76 % accuracy
◦ Exploiting domain knowledge – 62-79 %
Top – 1 Accuracy – 62-92%
Top – 3 Accuracy – 64 –95%
Exploiting domain knowledge and presence of
overlap data increases accuracy

Default System – 33-55%
Exploiting domain constraint and overlap data – 50 86%
Top-1 Accuracy –

Top – 3 Accuracy




◦ Disjoint data – 27 – 58 %
◦ Overlap Data – 70-85%
◦ Disjoint data – 43 – 92 %
◦ Overlap Data – 43 – 92%
Exploiting domain constraints and overlap data is
significant
High accuracy in Top -3 Matches for complex
matches


Top - 3 accuracy is better than Top – 1
Removing small attributes that act as noise in
a complex match is difficult
 Phone=concat(id,areacode,number), id is a small value
that acts as noise

Difficult to find small attributes that form
parts of a complex match
 Address=concat(street,city,state) , apt# is a small
value that is missed

L. Xu and D. Embley. Using domain ontologies to discover
direct and indirect matches for schema elements. In Proc.of
the Semantic Integration Workshop at ISWC-2003.
 Uses domain ontology to find relationships and mappings between two schemas
 Can be useful in some contexts such as the date searcher.
 More such modules can be added

Clio – a complimentary work that uses schema mappings to provide
an interactive system that generates transformations from one
schema to another.






Finding complex matches is a tough task
iMap identifies both 1-1 and complex matches
Uses search modules to generate matches
Exploits domain knowledge and overlap data to
improve accuracy
Provides a explanation feature to better assist the
user
The accuracy levels obtained from
experimentation shows that on further fine
tuning and improvisation it can be even better



Handle other types of data sources other than
relational data sources
Provide better user interaction style
Integrate more ontology based searchers









How is overlap data identified?
How are the domain constraints taken as input
How does the user interact with the system?
How can the accuracy be improved?
Can the system learn from past experiences like user
suggestions (not just past matches)?
Is it scalable? Can it handle multiple schema matching (like
DI systems where number of schemas are matched with a
mediated schema)?
What other search modules do you think need to be
integrated?
How efficient is it when handling complex schemas (the
schemas experimented which are rather small)?
Can it be extended to find complex matches over multiple
tables more accurately?
Download