Outline Background HumMer – Automatic Data Fusion System Duplicate Detection methods Duplicates Detection in Database Integration An efficient method using priority queue based on Extended key Approach based on Classification methods Approach More Resources Data Integration It is the problem of taking multiple independently developed databases and resolving the difference between them, to make them appear as one. Background Entity Matching/Identification What is Entity Matching/Identification? What is Data Integration? Duplicate Detection Schema level Identify Given a pair of records drawn from semantically corresponding tables in multiple heterogeneous databases, the objective is to determine whether the records represent the same real-world entity. It the duplicate attributes Instance level Identify the duplicate records is closed related to Duplicate Detection. 1 Introduction HumMer – Automatic Data Fusion System It is a system trying to fuse heterogeneous, duplicate, and conflicting data. [1] It incorporates Schema Matching (Records) Detection Conflict Resolution Duplicate System Diagram Components Query Language Based on SQL For Example Components Conflict Resolution SQL function (min, max, sum, …) Others Components Schematic Heterogeneity Resolving Schema Matching Detect a few duplicates in two unaligned databases Convert tuples to strings Do string matching to find duplicates Derive attribute correspondences based on similar attribute values of duplicates Two duplicates are compared field-wise, resulting in a matrix containing similarity scores for each attribute combination Using threshold to get the final attribute correspondences 2 Components Schematic Heterogeneity Resolving Data Components Duplicate Records Detection Specify Transformation the relevant attributes tuples pairwisely using a similarity measure Objects with similarity above a given threshold are considered as duplicates The closures of duplicates is computed, in which each is assigned one <sourceID> For the two relations to be fused One schema is chosen to determine the name of attribute correspondences in fused table All tables receive an additional “sourceID” The full outer union of all tables is computed. Compare Duplicates Detection Methods Duplicate Detection Methods An efficient entity identification method using priority queue [2] Approach based on Extended key [3] Approach based on Classification methods [4] Introduction An Efficient Entity Identification Method Using Priority Queue Addressed Issue: Entity Identification with improved efficiency. Standard method Sort the table based on an application-specific key nearby records by a sliding window Comparing Cost expensive pairwise record comparison 3 Main Component How to improve efficiency? Main Component Each record is only compared with clusters of duplicates in Priority Queue. Strategy of Priority Queue When there is a new record, its membership with clusters in the queue is checked. The record only need to be compared with representatives of cluster Priority Queue It contains clusters of duplicates. Each cluster contains some records as representatives. Every cluster has a priority. The newly updated cluster has the highest priority If it belongs to none, create a new cluster for it, and put it in the queue. If the queue is full, the cluster with least priority will be moved out. If it belongs to one cluster, the cluster will be given highest priority. If the matching score of the record is below a certain threshold, set it as one representative. The queue has a fixed size The cluster with least priority will be replaced by the newly-added cluster. Other Component How to compare records? Convert record to a long string Use Edit Distance algorithm Workflow 1. 2. Check each record Example: edit(Error,Eror) = 1, edit(great,grate) = 2 3. Compare it with the representatives of each cluster in priority queue. Update the priority queue Scan through the database again Check each record (same as before) How to store and query the duplicate clusters? Union-Find data structure Union: Combine or merge two sets into a single set. Find: Determine which set a particular element is in. Also useful for determining if two elements are in the same set. Sort the records according to the given attributes Scan through the database sequentially 4. End Introduction Approach based on Extended key Addressed Issue: How to determine the correspondence between records from multiple databases with different schemas? 4 Main Idea An Example What is Extended Key ? A minimal set of attributes able to uniquely identify an instance New way to resolve the heterogeneity Use Extended Key Equivalence It is defined by ILFD (Instance level functional dependencies) tables Main Idea How to get ILFD tables? Such semantic information can be supplied Approach based on Classification methods By database administrators during Schema Integration Through some knowledge acquisition tools Introduction Addressed Issue: Entity Matching using machine learning method. Strategy How to use one pair of records as input of classifier? Convert it to a vector. Here is an example: Main Idea: Model the problem as a binary (match or non-match) classification problem. For each pair of records, the trained classifier will tell whether they are duplicates or not. A vector with length of 16 5 Strategy How to get training example? One way is to get suggestions from domain expert. way is to use partial common key, if available. Another Reading Resources Schema Matching Rahm, Erhard and Bernstein, Philip A. (2001) A survey of approaches to automatic schema matching. VLDB Journal Duplicate Record Detection A. K. Elmagarmid, P. G. Ipeirotis and V. S. Verykios. Duplicate record detection: A survey. TKDE 19(1): 1--16, 1007 Question & Answer More Resources Paper Lists [1] A. Bilke, J. Bleiholder, C. Bohm, K. Draba, F. Naumann, and M. Weis. Automatic data fusion with hummer. In Proc. of VLDB, Trondheim, Norway, 2005 [2] A.E. Monge and C.P. Elkan, “An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records,” Proc. ACM-SIGMOD Workshop Research Issues on Knowledge Discovery and Data Mining, 1997 [3] Ee-Peng Lim , Jaideep Srivastava , Satya Prabhakar , James Richardson, Entity Identification in Database Integration, Proceedings of the Ninth International Conference on Data Engineering, p.294-301, April 19-23, 1993 [4] Huimin Zhao, Sudha Ram, Entity matching across heterogeneous data sources: An approach based on constrained cascade generalization. Data & Knowledge Engineering, Volume 66, 2008, 368-381 Tack! 6 Table R and S as shown below: Table R Name Exercise 1. Use Extended Key to do Entity Identification[1] City ZIP PersonNr Eva Aadde INGARÖ 13469 840126 -1223 Eva Aalto Norsborg 14564 851201-1225 Eva Abrahamsson INGARÖ 13463 861227-1227 Table S HomeAddress Telephone Eva Aadde Myskviksvägen 8 08-571 480 27 Eva Abrahamsson Myrvägen 2 08-570 290 91 Eva Abrahamsson Pilgatan 9 08-642 61 79 Eva Abrahamsson Nyängsvägen 39A 08-530 356 44 Suppose the extended key is {name, city, homeaddress} and the following ILFDs: (E. HomeAddress=” Myskviksvägen 8”) ->(E.City= ”INGARÖ”) (E. HomeAddress=”Myrvägen 2”) ->(E.City= ” INGARÖ”) (E. HomeAddress=” Pilgatan 9 ”) ->(E.City= ”STOCKHOLM”) (E. HomeAddress=” Nyängsvägen 39A”) ->(E.City= ” TULLINGE”) Name Exercise 2. Use Priority Queue to do Duplicate Detection[2] Please construct the integrated table. ----------------------------------------------------[1] Lim , Jaideep Srivastava , Satya Prabhakar , James Richardson, Entity Identification in Database Integration, Proceedings of the Ninth International Conference on Data Engineering, p.294-301, April 19-23, 1993 3. • Given conditions below, please use Priority Queue algorithm to find the Duplicate Clusters within. 4. 1. Table R, which is already sorted according to application-specific key : 2. 5. Similarities between tuples 6. T1 Tuple T1 T1 T2 0.6 T3 T4 T5 0.3 0.5 0.1 0.2 0.2 0.4 0.4 0.4 0.2 0.9 0.4 0.6 0.5 0.4 0.6 0.6 0.4 0.8 0.6 T3 T3 0.1 0.2 T4 T4 0.3 0.4 0.9 T5 T5 0.5 0.4 0.4 0.4 T6 0.1 0.4 0.6 0.6 0.4 T7 0.2 0.2 0.5 0.6 0.8 T6 T7 T7 0.1 T2 T2 T6 0.4 0.4 Method to count Matching Sorce: Given one cluster, the Matching Sorce of one tuple is : The average of the tuple’s similarity with the cluster’s all representitives. The condition to declare a new cluster : matching score > 0.5 The condition to declare a representitive: 0.5 < matching score < 0.8 The size of Priority Queue: 2 ----------------------------------------------------[2] A.E. Monge and C.P. Elkan, “An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records,” Proc. ACM-SIGMOD Workshop Research Issues on Knowledge Discovery and Data Mining, 1997 7