Step 3: Matching Algorithm (Matcher)

advertisement
HKU CSIS DB Seminar:
COMA-A system for flexible
combination of schema matching
approaches
- VLDB 2002 Hong-Hai Do and Erhard Rahm
Speaker: Eric Lo
http://www.csis.hku.hk/~dbgroup/seminar/seminar020927.htm
What is Schema Matching?
• Finding semantic correspondences between elements of
two schemas
• Input: 2 schemas
• Output: A set of mappings
DB Seminar
2
Why Schema Matching?
•
•
•
•
Done by domain experts
Time consuming
Reduce user effort
Semi-automatic
– Need user to verify
– Need user to modify
DB Seminar
3
Application domains
• Ecommerce:
– E.g. a comparison shopping website
– Aggregates product offer from multiple
independent online stores
– Match each product catalog against their
combined catalog
• [Amazon].product_code  [Combined].product_id
• [Wrox].bookid  [Combined].product_id
DB Seminar
4
Application domains
• Data warehouses and data integration
system
– Preprocessing
• Data translation
– XML  Relation data mapping
DB Seminar
5
Schema matching categories
• Goal: High match accuracy for large variety of schemas
• A single technique is not enough for different schemas
combine different approach effectively
• Hybrid approach:
– Most common
– Different match criteria (e.g. name, data type, dictionary,
thesaurus…) are used in a single algorithm
• Composite approach:
– High flexibility
– 1 match algorithm for single match criteria
– Combine the independent result from algorithms
DB Seminar
6
Outline
• Introduction
• COMA system
•
•
•
•
•
•
Overview of different matchers
Reuse matcher from COMA
Evaluation
Conclusions
Discussions
References
DB Seminar
7
COMA-COmbining MAtch
algorithm
• Composite approach
• No previous work on composite generic
matching
• A generic match system
• Support multiple schema (e.g. XML and
relational)
DB Seminar
8
COMA
• Different match algorithm exists as extensible
library in COMA  Matcher
• Support different combination of extensible
library (match algorithm) result
• An evaluation platform to systematically examine
and compare the effectiveness of different
matchers (matching algorithm/extensible library)
and combination strategies
DB Seminar
9
COMA
• Interactive and iterative match process
which allow user feed-back
• Also propose a new matcher, reusing
previously obtained match results (they
observed that many schemas to be matched
are very similar to previously matched
schema)
DB Seminar
10
Matching Process
Matcher Library:
Simple matchers: ngram, synoymn
Hybrid: NamePath
Schema1
Matcher1
UserFeedback
Matcher2
Matcher3
Similarity
Cube
Schema2
Combine match result
<schema1.cname><schema2.companyname> Sim = 0.95
<schema1.cname><schema2.businessname> Sim = 0.8
<schema1.address><schema2.address> Sim =1
DB Seminar
S1 S2
S2 S1
11
5 Steps
•
•
•
•
•
Step1: Schema Representation
Step2: Schema Tree  Distinct Elements
Step3: Matching Algorithm (Matcher)
Step4: Aggregation of k matcher values
Step5: Selection
DB Seminar
12
Step1: Schema Representation
DB Seminar
13
XML Schema Representation
DB Seminar
14
Step 2
• Traverse the schema tree
• Represented each schema
element by its path
– Sequences of nodes from root
– E.g. Address in PO2
– Multiple paths
• PO2DeliverToAddress
• PO2BillToAddress
DB Seminar
15
Step 3: Match algorithmS
• Take in each schema element path
• Returning similarity value
• If involve human feedback:
– User approved, similarity is 1 (0 in contrast)
• Different matchers return similarity value
between 0 to 1
• COMA support simple, hybrid, reuseoriented matchers now (discuss later)
DB Seminar
16
Storing k matchers result
by Similarity cube k
•
•
•
•
k matchers
m
m schema 1 elements
n schema 2 elements
n
A cube of k x m x n is stored in repository
for later combination and selection steps
DB Seminar
17
Some samples from similarity
cube
Matcher
PO1 Elements
PO2 Elements
Sim
Matcher1:
Type-name
ShipTo.shipToCity
DeliverToAddress.City
0.65
Matcher2:
Name-path
ShipTo.shipToStreet
0.3
ShipTo.Customer.custCity
0.8
ShipTo.shipToCity
DeliverTo.Address.City
0.78
ShipTo.shipToStreet
0.73
ShipTo.Customer.custCity
0.53
DB Seminar
18
Step 4 and 5: Combine match
result
• Combine k result from the similarity cube
• Step 4: Aggregation
– Aggregation of matcher-specific results
• E.g. taking average of k values / max /min
•
•
•
ShipTo.shipToCity
DeliverToAddress.City
ShipTo.shipToStreet
ShipTo.Customer.custCity
0.72
0.52
0.67
• Step 5: Selection
– Selection of match candidates
• Select ShipTo.shipToCity  DeliverToAddress.City (0.72)
DB Seminar
19
How the matchers work?
•
•
•
•
•
Step 1: Schema Representation
Step 2: Schema Tree  Distinct Elements
Step 3: Matching Algorithm (Matcher)
Step 4: Aggregation of k matcher values
Step 5: Selection
DB Seminar
20
Type
Simple
Hybrid
Reuse-oriented
COMA Matcher Library
Name
Schema Info
Aux. Info
Affix
Element names
-
N-gram
Element names
-
Soundex
Element names
-
EditDistance
Element names
-
Synonym
Element names
Extern, dictionaries
Data Type
Data types
Data type
compatibility table
UserFeedback
-
User-specified (mis-)
matches
NameMatcher
Element names
-
NamePath
Names+Paths
-
TypeName
DataTypes+Path
-
Children
Child elements
-
Leaves
Leaf elements
-
Schema
-
Existing schemalevel match results
DB Seminar
21
Simple Matcher
• Use element name to compare
– Name string
– Name semantic
• Can use approximate string matching technique
(apply on data cleansing)
• Affix: Looks for common (prefix and suffix) on
NameString
• DataType: Similarity = degree of compatibility of
2 datatypes (values are predefined)
– E.g. int and bit = 0.6, text and hex =0.1
DB Seminar
22
Hybrid Matcher
• Fixed combination of simple matcher
• E.g. EditDistance + Data Type
• Hybrid Matcher 1 (Name Matcher):
– Tokenization(POShipTo  PO, Ship, To)
– Expansion (PO Purchase, Order)
– Then use e.g. Affix + Trigram
DB Seminar
23
Another Hybrid Matcher
• NamePath Matcher:
–
–
–
–
Name + Path (element + structure)
Build a long string from path
Apply Name Matcher
E.g. PurchaseOrder.ShipTo.Street and
PurchaseOrder.shipToStreet
– Same in Name Matcher, but not in NamePath
DB Seminar
24
Outline
• Introduction
• COMA system
• Overview of different matchers [Step 3]
• Reuse matcher for COMA [Step 3]
•
•
•
•
Evaluation
Conclusions
Discussions
References
DB Seminar
25
Reuse of previous match result
• Based on authors observation:
– Many schemas to be matched are similar (or identical)
to previous matched schema
– Build a reuse-oriented matcher to save resources
– A match with B before (A  B) [Match 101]
– B match with C before (B  C) [Match 234]
– Now new match task, A  C
• MatchCompose operation combine previous match
result to obtain new match result
DB Seminar
26
MatchCompose operation
• Given 2 match results:
– match1: S1<-> S2
– match2: S2 <-> S3
• MatchCompose derives a new match result
S1 <-> S3
•
•
•
•
•
PO1.Contact <-> PO2.Contact <-> PO3.Contact
Name
name
lastName
Email
email
firstName
Company
email
company
DB Seminar
MatchCompose
mapping
Match:S1<->S3
27
MatchCompose in relation
Match1
PO1
Name
Email
MatchCompose
PO2 SIM12
name 1.0
e-mail 1.0
PO1
PO3
SIM13
Name
lastName 0.8
Name
firstName 0.8
Email
email
Match2
PO2
PO3
SIM23
Name
lastName 0.6
Name
firstName 0.6
e-mail
email
1.0
1.0
DB Seminar
28
Re-use: Schema matcher
• All previous match store in repository
• New matching problem comes, e.g. S1 match with S2
• Find all match result with schema (Si, Sj and Sk) related
to BOTH S1 and S2 in any order
• Each pair undergoes MatchCompose
DB Seminar
29
How to aggregate the results
from k matchers?
•
•
•
•
•
Step 1: Schema Representation
Step 2: Schema Tree  Distinct Elements
Step 3: Matching Algorithm (Matcher)
Step 4: Aggregation of k matcher values
Step 5: Selection
DB Seminar
30
How to combine similarity values
from different matcher?
• Aggregate to a single similarity value from different matchers
•Max: return the max values from M matchers
•Weighted sum: weight assign according to the expected
importance of the matchers
•Average
•Min
DB Seminar
31
Along so many combinations, how to select
the set of result which return to user?
•
•
•
•
•
Step 1: Schema Representation
Step 2: Schema Tree  Distinct Elements
Step 3: Matching Algorithm (Matcher)
Step 4: Aggregation of k matcher values
Step 5: Selection
DB Seminar
32
Select candidates from combined
cube
• Direction of match candidates selection
•Given 2 schemas S1 and S2 with |S2| <= |S1|
•3 Directions: LargeSmall, SmallLarge, Both
•LargeSmall: Match Large Schema S1 with Small target S2,
i.e. elements from S1 are ranked and selected with respect to each S2 element
DB Seminar
33
3 directions
Small Schema
Large Schema
DeliverToAddress
BillToAddress
shipToCity
0.72
0.71
custCity
0.67
0.68
shipToStreet
0.52
0.6
LargeSmall
SmallLarge
Both
For each small schema element
For each large schema element
LargeSmall + Small Large
- DeliverToAddress
choose shipToCity
- shipToCity
choose DeliverToAddress
YES
- BillToAddress
choose shipToCity
- custCity
choose BillToAddress
NO
- shipToStreet
choose BillToAddress
NO
DB Seminar
34
Selecting candidates (cont)
• Along one direction, 3 ways to select:
– MaxN: Select n candidates with top sim. values
• If n=1, 1 to 1 correspondence
– MaxDelta: select the MaxN one, given a
tolerance value d, also select those candidates
with sim value > MaxN – d
• Select those almost maximum
– Threshold: All elements > threshold t
DB Seminar
35
Evaluation
• Test by 5 real world schemas on purchase
order
– CIDX, Excel, Noris, Paragon and Aperturm (from
www.biztalk.org)
– |Inner or Leaf nodes| != |paths|  Schema share
some fragments
DB Seminar
36
Data Sets
•
•
•
•
•
5 schemas, 10 match tasks
Done manually, domain experts
#Matches = no of correspondences to identified
Shows the problem sizes
Schema Similarity=#MatchedPaths/#AllPaths
DB Seminar
37
Evaluation – match quality
• Automatic match returns P matches
• I is true positive (by domain experts)
P
c I
• Precision= |c|/|P|  reliability of match
predictions
• Recall= |c|/|I|  % of real matches found
• Accuracy = Recall*(2-1/Precision)
• Accuracy = no. of labour saving to modify
incorrect matches to correct matches + no of
labour saving to identify missed matches
DB Seminar
38
Experimental result
• Only in automatic mode
• Conducted 12,312 experiments set
– Different choices of matchers
– Different choices of direction etc
• Each combination runs on 10 schemas
matching task (1<->2, …)
DB Seminar
39
Distribution of no-reuse matchers
Accuracy
• 1 series = 1 combination
• Most (7077) no-reuse matchers with Accuracy < 0
DB Seminar
40
Distribution w.r.t. aggregation
Accuracy
DB Seminar
41
Distribution w.r.t. direction
Accuracy
DB Seminar
42
Distribution w.r.t. selection
Accuracy
DB Seminar
43
Outline
•
•
•
•
•
Introduction
COMA system
Overview of different matchers
Reuse matcher from COMA
Evaluation
• Conclusions
• Discussions
• References
DB Seminar
44
Conclusions
• COMA provides a framework for
combining different matcher for different
purposes
• A new matcher – Reuse-oriented matcher
DB Seminar
45
Discussions
• Most are 1:1 matching, n:1 , n:m?
ac
(1:1) local
bc
(1:1) local
ac
b
(2:1) local
(2:1) global
• Accuracy metric
• Time is a problem?
• To match 2 schemas, A  B is a must?
– How about if A map to B in some extend, B
map to A in another extend?
DB Seminar
46
References
• [VLDB02] COMA-A system for flexible combination of
schema matching approaches
– By Hong-hai Do, Erhard Rahm
– University of Leipzig
• [ICDE02] Similarity Flooding: A Versatile Graph
Matching Algorithm and its Application to Schema
Matching
– By Sergey Melik, Hector Garcia-Molina, Erhard Rahm
– Stanford and University of Leipzig
• [VLDB02] Translating Web Data
– By Lucian Popa, Yannis Velegrakis, Renee J. Miller, et. al.
– IBM Almaden Research Center and University of Toronto
DB Seminar
47
eNd
DB Seminar
48
Interactive mode
• In contrast with auto mode
• User interactive with COMA for each
iteration (optional)
• E.g.
– Specify which matcher (simple / hybrid)
– Accept / reject match candidates
• Improve match quality
DB Seminar
49
Simple Matcher
• EditDistance: Similarity = No of edit need
to transform one string to another
• Synonym: Looking up the terminological
relationship in a specific dictionary
• N-gram: i.e. sequences of n characters
• Soundex: Based on the phonetic similarity
DB Seminar
50
Hybrid Matcher
• TypeName Matcher:
– DataType + Name Matcher
• Children Matcher:
– Leaf compared with TypeName Matcher
– If compare two non-leave elements A and B,
compare A’s children with B’s children
DB Seminar
51
Hybrid Matcher
• Leave Matcher:
– Similar to Children Matcher, but only consider the
leaves with TypeName Matcher
– PO1.ShipTo.shipToStreet
– PO1.ShipTo.shipToCity
– PO2.DeliverTo.Address.Street
– PO2.DeliverTo.Address.City
– If cmp ShipTo with DeliverTo by Children Matcher, i.e.
shipToStreet cmp with Address!!
DB Seminar
52
Download