Outline Duplicates Detection in Database Integration Background

advertisement
Outline
Background
„ HumMer – Automatic Data Fusion System
„ Duplicate Detection methods
„
Duplicates Detection
in Database Integration
… An
efficient method using priority queue
based on Extended key
… Approach based on Classification methods
… Approach
„ More
Resources
Data Integration
„
It is the problem of taking multiple independently
developed databases and resolving the difference between
them, to make them appear as one.
Background
Entity Matching/Identification
„
What is Entity Matching/Identification?
What is Data Integration?
Duplicate Detection
„
Schema level
… Identify
Given a pair of records drawn from semantically
corresponding tables in multiple heterogeneous databases,
the objective is to determine whether the records represent
the same real-world entity.
… It
„
the duplicate attributes
Instance level
… Identify
the duplicate records
is closed related to Duplicate Detection.
1
Introduction
HumMer – Automatic Data
Fusion System
„
It is a system trying to fuse heterogeneous, duplicate,
and conflicting data. [1]
„
It incorporates
… Schema
Matching
(Records) Detection
… Conflict Resolution
… Duplicate
System Diagram
Components
„
Query Language
… Based
on SQL
For Example
Components
„
Conflict Resolution
… SQL
function (min, max, sum, …)
… Others
Components
„
Schematic Heterogeneity Resolving
… Schema
„
…
…
„
Matching
Detect a few duplicates in two unaligned databases
Convert tuples to strings
Do string matching to find duplicates
Derive attribute correspondences based on similar attribute values of
duplicates
…
…
Two duplicates are compared field-wise, resulting in a matrix
containing similarity scores for each attribute combination
Using threshold to get the final attribute correspondences
2
Components
„
Schematic Heterogeneity Resolving
… Data
Components
„
Duplicate Records Detection
… Specify
Transformation
the relevant attributes
tuples pairwisely using a similarity measure
… Objects with similarity above a given threshold are
considered as duplicates
… The closures of duplicates is computed, in which each is
assigned one <sourceID>
For the two relations to be fused
„ One schema is chosen to determine the name of attribute
correspondences in fused table
„ All tables receive an additional “sourceID”
„ The full outer union of all tables is computed.
… Compare
Duplicates Detection Methods
„
Duplicate Detection
Methods
„
„
An efficient entity identification method using priority
queue [2]
Approach based on Extended key [3]
Approach based on Classification methods [4]
Introduction
„
An Efficient Entity
Identification Method Using
Priority Queue
Addressed Issue:
… Entity
„
Identification with improved efficiency.
Standard method
… Sort
the table based on an application-specific key
nearby records by a sliding window
… Comparing
„
Cost expensive pairwise record comparison
3
Main Component
„
How to improve efficiency?
Main Component
„
… Each
record is only compared with clusters of duplicates in
Priority Queue.
Strategy of Priority Queue
… When
there is a new record, its membership with clusters in
the queue is checked.
„
„
The record only need to be compared with representatives of cluster
Priority Queue
… It
…
contains clusters of duplicates.
„
Each cluster contains some records as representatives.
…
… Every cluster has a priority.
„ The newly updated cluster has the highest priority
If it belongs to none, create a new cluster for it, and put it in the queue.
ƒ If the queue is full, the cluster with least priority will be moved
out.
If it belongs to one cluster, the cluster will be given highest priority.
ƒ If the matching score of the record is below a certain threshold,
set it as one representative.
… The queue has a fixed size
„ The cluster with least priority will be replaced by the newly-added
cluster.
Other Component
„
How to compare records?
… Convert
record to a long string
… Use Edit Distance algorithm
„
Workflow
1.
2.
Check each record
„
Example: edit(Error,Eror) = 1, edit(great,grate) = 2
„
3.
„
Compare it with the representatives of each cluster in priority queue.
Update the priority queue
Scan through the database again
Check each record (same as before)
How to store and query the duplicate clusters?
… Union-Find data structure
„ Union: Combine or merge two sets into a single set.
„ Find: Determine which set a particular element is in. Also useful for
determining if two elements are in the same set.
Sort the records according to the given attributes
Scan through the database sequentially
4.
End
Introduction
„
Approach based on
Extended key
Addressed Issue:
… How
to determine the correspondence between records from
multiple databases with different schemas?
4
Main Idea
„
An Example
What is Extended Key ?
…A
minimal set of attributes able to uniquely identify an
instance
„
New way to resolve the heterogeneity
… Use Extended Key Equivalence
„
It is defined by ILFD (Instance level functional dependencies) tables
Main Idea
„
How to get ILFD tables?
… Such semantic information can be supplied
Approach based on
Classification methods
By database administrators during Schema Integration
„ Through some knowledge acquisition tools
„
Introduction
„
Addressed Issue:
… Entity
Matching using machine learning method.
Strategy
„
How to use one pair of records as input of classifier?
… Convert
„
„
it to a vector.
Here is an example:
Main Idea:
… Model
the problem as a binary (match or non-match)
classification problem.
… For each pair of records, the trained classifier will tell
whether they are duplicates or not.
A vector with length of 16
5
Strategy
„
How to get training example?
… One
way is to get suggestions from domain expert.
way is to use partial common key, if available.
… Another
Reading Resources
„
Schema Matching
…
Rahm, Erhard and Bernstein, Philip A. (2001) A survey of approaches
to automatic schema matching. VLDB Journal
„
Duplicate Record Detection
…
A. K. Elmagarmid, P. G. Ipeirotis and V. S. Verykios. Duplicate record
detection: A survey. TKDE 19(1): 1--16, 1007
Question & Answer
More Resources
Paper Lists
[1] A. Bilke, J. Bleiholder, C. Bohm, K. Draba, F. Naumann, and M. Weis. Automatic
data fusion with hummer. In Proc. of VLDB, Trondheim, Norway, 2005
[2] A.E. Monge and C.P. Elkan, “An Efficient Domain-Independent Algorithm for
Detecting Approximately Duplicate Database Records,” Proc. ACM-SIGMOD
Workshop Research Issues on Knowledge Discovery and Data Mining, 1997
[3] Ee-Peng Lim , Jaideep Srivastava , Satya Prabhakar , James Richardson, Entity
Identification in Database Integration, Proceedings of the Ninth International
Conference on Data Engineering, p.294-301, April 19-23, 1993
[4] Huimin Zhao, Sudha Ram, Entity matching across heterogeneous data sources: An
approach based on constrained cascade generalization. Data & Knowledge
Engineering, Volume 66, 2008, 368-381
Tack!
6
„
Table R and S as shown below:
Table R
Name
Exercise 1.
Use Extended Key to do Entity Identification[1]
City
ZIP
PersonNr
Eva Aadde
INGARÖ
13469
840126 -1223
Eva Aalto
Norsborg
14564
851201-1225
Eva Abrahamsson
INGARÖ
13463
861227-1227
Table S
„
HomeAddress
Telephone
Eva Aadde
Myskviksvägen 8
08-571 480 27
Eva Abrahamsson
Myrvägen 2
08-570 290 91
Eva Abrahamsson
Pilgatan 9
08-642 61 79
Eva Abrahamsson
Nyängsvägen 39A
08-530 356 44
Suppose the extended key is {name, city, homeaddress} and
the following ILFDs:
…
…
(E. HomeAddress=” Myskviksvägen 8”) ->(E.City= ”INGARÖ”)
(E. HomeAddress=”Myrvägen 2”) ->(E.City= ” INGARÖ”)
(E. HomeAddress=” Pilgatan 9 ”) ->(E.City= ”STOCKHOLM”)
…
(E. HomeAddress=” Nyängsvägen 39A”) ->(E.City= ” TULLINGE”)
…
„
Name
Exercise 2.
Use Priority Queue to do Duplicate Detection[2]
Please construct the integrated table.
----------------------------------------------------[1] Lim , Jaideep Srivastava , Satya Prabhakar , James Richardson, Entity Identification in Database Integration,
Proceedings of the Ninth International Conference on Data Engineering, p.294-301, April 19-23, 1993
3.
• Given conditions below, please use Priority Queue algorithm
to find the Duplicate Clusters within.
4.
1.
Table R, which is already sorted
according to application-specific key :
2.
5.
Similarities between tuples
6.
T1
Tuple
T1
T1
T2
0.6
T3
T4
T5
0.3
0.5
0.1
0.2
0.2
0.4
0.4
0.4
0.2
0.9
0.4
0.6
0.5
0.4
0.6
0.6
0.4
0.8
0.6
T3
T3
0.1
0.2
T4
T4
0.3
0.4
0.9
T5
T5
0.5
0.4
0.4
0.4
T6
0.1
0.4
0.6
0.6
0.4
T7
0.2
0.2
0.5
0.6
0.8
T6
T7
T7
0.1
T2
T2
T6
0.4
0.4
Method to count Matching Sorce:
Given one cluster, the Matching Sorce of one tuple is :
The average of the tuple’s similarity with the cluster’s all representitives.
The condition to declare a new cluster :
matching score > 0.5
The condition to declare a representitive:
0.5 < matching score < 0.8
The size of Priority Queue:
2
----------------------------------------------------[2] A.E. Monge and C.P. Elkan, “An Efficient Domain-Independent Algorithm for Detecting Approximately
Duplicate Database Records,” Proc. ACM-SIGMOD Workshop Research Issues on Knowledge Discovery
and Data Mining, 1997
7
Download