Distinguishing Objects with Identical Names in Relational Databases 1 Motivations • Different objects may share the same name – In AllMusic.com, 72 songs and 3 albums named “Forgotten” or “The Forgotten” – In DBLP, 141 papers are written by at least 14 “Wei Wang” – How to distinguish the authors of the 141 papers? 2 (2) (1) Wei Wang, Haifeng Jiang, Hongjun Lu, Jeffrey Yu Wei Wang, Jiong VLDB 1997 Yang, Richard Muntz Haixun Wang, Wei Wang, SIGMOD 2002 Jiong Yang, Philip S. Yu Hongjun Lu, Yidong Yuan, ICDE 2005 Wei Wang, Xuemin Lin Jiong Yang, Hwanjo Yu, CSB 2003 Wei Wang, Jiawei Han Wei Wang, Xuemin Lin Jiong Yang, Jinze Liu, Wei Wang KDD 2004 Jian Pei, Jiawei Han, Hongjun Lu, et al. Jinze Liu, Wei Wang ICDM 2004 Jian Pei, Daxin Jiang, Aidong Zhang VLDB 2004 ADMA 2005 ICDM 2001 (4) ICDE 2005 Aidong Zhang, Yuqing WWW 2003 Song, Wei Wang (3) Wei Wang, Jian CIKM 2002 Pei, Jiawei Han Haixun Wang, Wei Wang, ICDM 2005 Baile Shi, Peng Wang Yongtai Zhu, Wei Wang, Jian KDD 2004 Pei, Baile Shi, Chen Wang (1) Wei Wang at UNC (3) Wei Wang at Fudan Univ., China (2) Wei Wang at UNSW, Australia (4) Wei Wang at SUNY Buffalo 3 Challenges of Object Distinction • Related to duplicate detection, but – Textual similarity cannot be used – Different references appear in different contexts (e.g., different papers), and thus seldom share common attributes – Each reference is associated with limited information • We need to carefully design an approach and use all information we have 4 Overview of DISTINCT • Measure similarity between references – Linkages between references • Based on our empirical study, references to the same object are more likely to be connected – Neighbor tuples of each reference • Can indicate similarity between their contexts • References clustering – Group references according to their similarities 5 Similarity 1: Link-based Similarity • Indicate the overall strength of connections between two references – We use random walk probability between the two tuples containing the references – Random walk probabilities along different join paths are handled separately • Because different join paths have different semantic meanings • Only consider join paths of length at most 2L (L is the number of steps of propagating probabilities) 6 Example of Random Walk Publish Authors 1.0 vldb/wangym97 Wei Wang 0.5 vldb/wangym97 Jiong Yang 0.5 Jiong Yang 0.5 Richard Muntz 0.5 vldb/wangym97 Richard Muntz Publications 1.0 vldb/wangym97 STING: A Statistical Information Grid Approach to Spatial Data Mining vldb/vldb97 Proceedings vldb/vldb97 Very Large Data Bases 1997 Athens, Greece 1.0 7 Path Decomposition • It is very expensive to propagate IDs and probabilities along a long join path • Divide each join path into two parts with equal (or almost equal) length – Compute the probability from each target object to each tuple, and from each tuple to each target object Rt R1 R2 0/0 Origin of probability propagation 1/1 0.1/0.25 0.2/0.5 0.2/0.5 0.5/0.5 0.2/1 0.2/1 0.4/0.5 0.2/0.5 0/0 0/0 8 Path Decomposition (cont.) • When computing probability of walking from one target object to another – Combine the forward and backward probabilities R1 R2 …… …… …… …… …… …… …… …… R1: 0.1/0.25 R2: 0.2/0.1 t1 t2 t3 t4 Probability of walking from O1 to O2 through t1 is 0.1*0.1 = 0.01 9 Similarity 2: Neighborhood Similarity • Find the neighbor tuples of each reference – Neighbor tuples within L joins • Weights of neighbor tuples – Different neighbor tuples have different connection to the reference – Assign each neighbor tuple a weight, which is the probability of walking from the reference to this tuple • Similarity: Set resemblance between two sets of neighbor tuples 10 Example of Neighbor Objects Authors 0.2 Richard Muntz Publish vldb/wangym97 Wei Wang …… 0.2 Jiong Yang 0.1 Jinze Liu 0.05 Haixun Wang 11 Automatically Build A Training Set • Find objects with unique names – For persons’ names • A person’s name = first name + last name (+middle name) • A rare first name + a rare last name → this name is likely to be unique • For other types of objects (paper titles, product names, …), consider the IDF of each term in object names • Use two references to an object with unique name as a positive example Johannes Gehrke • Use two references to different objects as a negative example 12 Selecting Important Join Paths Join Paths …… coauthor …… A1 B1 C1 A2 …… …… conference year …… …… B2 C2 • Each pair of training objects are connected with certain probability via each join path • Construct training set: Join path → Feature • Use SVM to learn the weight of each join path 13 Clustering References • We choose agglomerative hierarchical clustering because – We do not know number of clusters (real entities) – We only know similarity between references – Equivalent references can be merged into a cluster, which represents a single entity 14 How to measure similarity between clusters? • Single-link (highest similarity between points in two clusters) ? – No, because references to different objects can be connected. • Complete-link (minimum similarity between them)? – No, because references to the same object may be weakly connected. • Average-link (average similarity between points in two clusters)? – A better measure 15 Problem with Average-link C1 C2 C3 • C2 is close to C1, but their average similarity is low • We use collective random walk probability: Probability of walking from one cluster to another • Final measure: Average neighborhood similarity and Collective random walk probability 16 Clustering Procedure • Procedure – Initialization: Use each reference as a cluster – Keep finding and merging the most similar pair of clusters – Until no pair of clusters is similar enough 17 Efficient Computation • In agglomerative hierarchical clustering, one needs to repeatedly compute similarity between clusters – When merging clusters C1 and C2 into C3, we need to compute the similarity between C3 and any other cluster – Very expensive when clusters are large • We invent methods to compute similarity incrementally – Neighborhood similarity – Random walk probability 18 Experimental Results • Distinguishing references to authors in DBLP – Accuracy of reference clustering • True positive: Number of pairs of references to same author in same cluster • False positive: Different authors, in same cluster • False negative: Same author, different clusters • True negative: Different authors, different clusters Precision = TP/(TP+FP) Recall = TP/(TP+FN) f-measure = 2*precision*recall / (precision+recall) Accuracy = TP/(TP+FP+FN) 19 Accuracy on Synthetic Tests • Select 1000 authors with at least 5 papers • Merge G (G=2 to 10) authors into one group • Use DISTINCT to distinguish each group of references 0.7 1 0.5 Precision 0.8 G2 G4 G6 G8 G10 0.4 0.6 G2 G4 G6 G8 G10 0.4 0.2 min-sim 1E-10 1.00E-05 5.00E-05 0.0002 0.001 0.005 0.3 1 Accuracy 0.6 0 0 0.2 0.4 0.6 0.8 1 Recall 20 Compare with “Existing Approaches” • Random walk and neighborhood similarity have been used in duplicate detection • We combine them with our clustering approaches for comparison 0.8 0.6 Max f-measure Max accuracy 0.8 0.4 DISTINCT Set resemblance - unsuper. Random walk - unsuper. Combined 0.2 0 2 4 6 Group size 8 0.6 0.4 DISTINCT Set resemblance - unsuper. Random walk - unsuper. Combined 0.2 0 10 2 4 6 Group size 8 10 21 Real Cases Name #author #ref accuracy precision recall f-measure Hui Fang 3 9 1.0 1.0 1.0 1.0 Ajay Gupta 4 16 1.0 1.0 1.0 1.0 Joseph Hellerstein 2 151 0.81 1.0 0.81 0.895 Rakesh Kumar 2 36 1.0 1.0 1.0 1.0 Michael Wagner 5 29 0.395 1.0 0.395 0.566 Bing Liu 6 89 0.825 1.0 0.825 0.904 Jim Smith 3 19 0.829 0.888 0.926 0.906 Lei Wang 13 55 0.863 0.92 0.932 0.926 Wei Wang 14 141 0.716 0.855 0.814 0.834 Bin Yu 5 44 0.658 1.0 0.658 0.794 0.81 0.966 0.836 0.883 average 22 Real Cases: Comparison 1 DISTINCT 0.9 Supervised set resemblance Supervised random walk Unsupervised combined measure Unsupervised set resemblance Unsupervised random walk 0.8 0.7 0.6 0.5 0.4 accuracy f-measure 23 Distinguishing Different “Wei Wang”s UNC-CH (57) Fudan U, China (31) Zhejiang U China (3) Najing Normal China (3) SUNY Binghamton (2) Ningbo Tech China (2) Purdue (2) Chongqing U China (2) Harbin U China (5) Beijing U Com China 5 2 UNSW, Australia (19) 6 SUNY Buffalo (5) Beijing Polytech (3) NU Singapore (5) (2) 24 Scalability • Agglomerative hierarchical clustering takes quadratic time – Because it requires computing pair-wise similarity 16 Time (second) 12 8 4 0 0 50 100 150 200 #references 250 300 350 25 Thank you! 26 Self-loop Phenomenon • In a relational database, each object connects to other objects via different join paths – In DBLP: author → paper → conference → … • An object usually connects to itself more frequently than most other objects of same type – E.g. an author in DBLP usually connects to himself more frequently than to other authors KDD04 KDD TKDE vol16 Charu Aggarwal associate editor TKDE TKDE vol15 Self-loops: connect to herself via proceedings, conferences, areas, … 27 An Experiment • Consider a relational database as a graph – Use DBLP as an example • Randomly select an author x – Perform random walks starting from x • With 8 steps – For each author y, calculate probability of reaching y from x • Not considering paths like “x→a→ … →a→x” – Rank all authors by the probability • Perform the above procedure for 500 authors (with at least 10 papers) who are randomly selected from 31608 authors 28 Results • For each selected author x, rank all authors by random walk probability from x • Rank of x himself Rank of x within: Portion of authors with such “self-rank” in the 500 authors Top 0.1% 62% Top 1% 86% Top 10% 99% Top 50% 100% 29 100% 80% 80% | 100% Portion of tuples Portion of tuples | Results (curves) 60% 40% #path 20% 60% 40% #path 20% Random walk Random walk 0% 0% 0.01% 0.05% 0.1% 0.5% 1% 5% Self-rank DBLP Database 10% 50% 1% 2% 5% 10% 20% 50% 100% Self-rank CS Dept Database 30