Duplicate Detection

advertisement
Duplicate Detection
Exercise 1.
Use Extended Key to do Entity Identification[1]
• Table R and S as shown below:
Table R
Name
City
ZIP
PersonNr
Eva Aadde
INGARÖ
13469
840126 -1223
Eva Aalto
Norsborg
14564
851201-1225
Eva Abrahamsson
INGARÖ
13463
861227-1227
Table S
Name
HomeAddress
Telephone
Eva Aadde
Myskviksvägen 8
08-571 480 27
Eva Abrahamsson
Myrvägen 2
08-570 290 91
Eva Abrahamsson
Pilgatan 9
08-642 61 79
Eva Abrahamsson
Nyängsvägen 39A
08-530 356 44
• Suppose the extended key is {name, city, homeaddress} and
the following ILFDs:
– (E. HomeAddress=” Myskviksvägen 8”) ->(E.City= ”INGARÖ”)
– (E. HomeAddress=”Myrvägen 2”) ->(E.City= ” INGARÖ”)
– (E. HomeAddress=” Pilgatan 9 ”) ->(E.City= ”STOCKHOLM”)
– (E. HomeAddress=” Nyängsvägen 39A”) ->(E.City= ” TULLINGE”)
• Please construct the integrated table.
----------------------------------------------------[1] Lim , Jaideep Srivastava , Satya Prabhakar , James Richardson, Entity Identification in Database Integration,
Proceedings of the Ninth International Conference on Data Engineering, p.294-301, April 19-23, 1993
Answer Exercise
• Integrated Table
Name
City
ZIP
PersonNr
HomeAddress
Telephone
Eva Aadde
INGARÖ
13469
840126 -1223
Myskviksvägen 8
08-571 480 27
Eva Abrahamsson
INGARÖ
13463
861227-1227
Myrvägen 2
08-571 480 27
Eva Abrahamsson
STOCKHOLM
NULL
NULL
Pilgatan 9
08-642 61 79
Eva Abrahamsson
TULLINGE
NULL
NULL
Nyängsvägen 39A
08-530 356 44
Exercise 2.
Use Priority Queue to do Duplicate Detection[2]
• Given conditions below, please use Priority Queue algorithm
to find the Duplicate Clusters within.
1.
Table R, which is already sorted
according to application-specific key :
2.
Similarities between tuples
T1
Tuple
T1
T1
T2
0.6
T3
T4
T5
0.3
0.5
0.1
0.2
0.2
0.4
0.4
0.4
0.2
0.9
0.4
0.6
0.5
0.4
0.6
0.6
0.4
0.8
0.6
T3
T3
0.1
0.2
T4
T4
0.3
0.4
0.9
T5
T5
0.5
0.4
0.4
0.4
T6
0.1
0.4
0.6
0.6
0.4
T7
0.2
0.2
0.5
0.6
0.8
T6
T7
T7
0.1
T2
T2
T6
0.4
0.4
3.
4.
5.
6.
Method to count Matching Sorce:
Given one cluster, the Matching Sorce of one tuple is :
The average of the tuple’s similarity with the cluster’s all representitives.
The condition to declare a new cluster :
matching score < 0.5
The condition to declare a representitive:
0.5 < matching score < 0.8
The size of Priority Queue:
2
----------------------------------------------------[2] A.E. Monge and C.P. Elkan, “An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate
Database Records,” Proc. ACM-SIGMOD Workshop Research Issues on Knowledge Discovery and Data
Mining, 1997
Answer
Record 1
Queue{1}
Record 2
2:1 = 0.6 > 0.5 and < 0.8
Queue {1,2}
Record 3
3:1 = 0.1 3:2 = 0.2 representitive = (0.1 + 0.2) /2 =
0.15
< 0.5
Queue {3} {1, 2}
Record 4
4:1 =0.3 4:2= 0.4 representitive = (0.3+0.4) /2 = 0.35
< 0.5
4:3= 0.9
> 0.5 and > 0.8
Queue {3, 4} {1,2}
Record 5
5:1 = 0.5 5:2 = 0.4 representitive = (0.5 +0.4) /2 =
0.45 < 0.5
5:3= 0.4 representitive = 0.4
<0.5
Queue {5} {3, 4}
{1,2}
Record 6
6:3 = 0.6 representitive = 0.6
> 0.5 and < 0.8
6:5 = 0.4 < 0.5
Queue {3, 4, 6} {5}
{1,2}
Record 7
7:3 = 0.5 7:6 = 0.4 representitive = (0.5 +0.4)/2 = 0.45
< 0.5
7:5 = 0.8
>0.5
Queue {5, 7} {3, 4, 6}
{1,2}
Download