Duplicate Detection Exercise 1. Use Extended Key to do Entity Identification[1] • Table R and S as shown below: Table R Name City ZIP PersonNr Eva Aadde INGARÖ 13469 840126 -1223 Eva Aalto Norsborg 14564 851201-1225 Eva Abrahamsson INGARÖ 13463 861227-1227 Table S Name HomeAddress Telephone Eva Aadde Myskviksvägen 8 08-571 480 27 Eva Abrahamsson Myrvägen 2 08-570 290 91 Eva Abrahamsson Pilgatan 9 08-642 61 79 Eva Abrahamsson Nyängsvägen 39A 08-530 356 44 • Suppose the extended key is {name, city, homeaddress} and the following ILFDs: – (E. HomeAddress=” Myskviksvägen 8”) ->(E.City= ”INGARÖ”) – (E. HomeAddress=”Myrvägen 2”) ->(E.City= ” INGARÖ”) – (E. HomeAddress=” Pilgatan 9 ”) ->(E.City= ”STOCKHOLM”) – (E. HomeAddress=” Nyängsvägen 39A”) ->(E.City= ” TULLINGE”) • Please construct the integrated table. ----------------------------------------------------[1] Lim , Jaideep Srivastava , Satya Prabhakar , James Richardson, Entity Identification in Database Integration, Proceedings of the Ninth International Conference on Data Engineering, p.294-301, April 19-23, 1993 Answer Exercise • Integrated Table Name City ZIP PersonNr HomeAddress Telephone Eva Aadde INGARÖ 13469 840126 -1223 Myskviksvägen 8 08-571 480 27 Eva Abrahamsson INGARÖ 13463 861227-1227 Myrvägen 2 08-571 480 27 Eva Abrahamsson STOCKHOLM NULL NULL Pilgatan 9 08-642 61 79 Eva Abrahamsson TULLINGE NULL NULL Nyängsvägen 39A 08-530 356 44 Exercise 2. Use Priority Queue to do Duplicate Detection[2] • Given conditions below, please use Priority Queue algorithm to find the Duplicate Clusters within. 1. Table R, which is already sorted according to application-specific key : 2. Similarities between tuples T1 Tuple T1 T1 T2 0.6 T3 T4 T5 0.3 0.5 0.1 0.2 0.2 0.4 0.4 0.4 0.2 0.9 0.4 0.6 0.5 0.4 0.6 0.6 0.4 0.8 0.6 T3 T3 0.1 0.2 T4 T4 0.3 0.4 0.9 T5 T5 0.5 0.4 0.4 0.4 T6 0.1 0.4 0.6 0.6 0.4 T7 0.2 0.2 0.5 0.6 0.8 T6 T7 T7 0.1 T2 T2 T6 0.4 0.4 3. 4. 5. 6. Method to count Matching Sorce: Given one cluster, the Matching Sorce of one tuple is : The average of the tuple’s similarity with the cluster’s all representitives. The condition to declare a new cluster : matching score < 0.5 The condition to declare a representitive: 0.5 < matching score < 0.8 The size of Priority Queue: 2 ----------------------------------------------------[2] A.E. Monge and C.P. Elkan, “An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records,” Proc. ACM-SIGMOD Workshop Research Issues on Knowledge Discovery and Data Mining, 1997 Answer Record 1 Queue{1} Record 2 2:1 = 0.6 > 0.5 and < 0.8 Queue {1,2} Record 3 3:1 = 0.1 3:2 = 0.2 representitive = (0.1 + 0.2) /2 = 0.15 < 0.5 Queue {3} {1, 2} Record 4 4:1 =0.3 4:2= 0.4 representitive = (0.3+0.4) /2 = 0.35 < 0.5 4:3= 0.9 > 0.5 and > 0.8 Queue {3, 4} {1,2} Record 5 5:1 = 0.5 5:2 = 0.4 representitive = (0.5 +0.4) /2 = 0.45 < 0.5 5:3= 0.4 representitive = 0.4 <0.5 Queue {5} {3, 4} {1,2} Record 6 6:3 = 0.6 representitive = 0.6 > 0.5 and < 0.8 6:5 = 0.4 < 0.5 Queue {3, 4, 6} {5} {1,2} Record 7 7:3 = 0.5 7:6 = 0.4 representitive = (0.5 +0.4)/2 = 0.45 < 0.5 7:5 = 0.8 >0.5 Queue {5, 7} {3, 4, 6} {1,2}