Authors

advertisement
Distinguishing Objects with Identical
Names in Relational Databases
1
Motivations
• Different objects may share the same name
– In AllMusic.com, 72 songs and 3 albums named
“Forgotten” or “The Forgotten”
– In DBLP, 141 papers are written by at least 14
“Wei Wang”
– How to distinguish the authors of the 141
papers?
2
(2)
(1)
Wei Wang, Haifeng
Jiang, Hongjun Lu,
Jeffrey Yu
Wei Wang, Jiong
VLDB 1997
Yang, Richard Muntz
Haixun Wang, Wei Wang, SIGMOD 2002
Jiong Yang, Philip S. Yu
Hongjun Lu, Yidong Yuan, ICDE 2005
Wei Wang, Xuemin Lin
Jiong Yang, Hwanjo Yu, CSB 2003
Wei Wang, Jiawei Han
Wei Wang, Xuemin Lin
Jiong Yang, Jinze Liu, Wei Wang KDD 2004
Jian Pei, Jiawei Han,
Hongjun Lu, et al.
Jinze Liu, Wei Wang ICDM 2004
Jian Pei, Daxin Jiang,
Aidong Zhang
VLDB 2004
ADMA 2005
ICDM
2001
(4)
ICDE
2005
Aidong Zhang, Yuqing WWW 2003
Song, Wei Wang
(3)
Wei Wang, Jian CIKM 2002
Pei, Jiawei Han
Haixun Wang, Wei Wang, ICDM 2005
Baile Shi, Peng Wang
Yongtai Zhu, Wei Wang, Jian KDD 2004
Pei, Baile Shi, Chen Wang
(1) Wei Wang at UNC
(3) Wei Wang at Fudan Univ., China
(2) Wei Wang at UNSW, Australia
(4) Wei Wang at SUNY Buffalo
3
Challenges of Object Distinction
• Related to duplicate detection, but
– Textual similarity cannot be used
– Different references appear in different contexts
(e.g., different papers), and thus seldom share
common attributes
– Each reference is associated with limited
information
• We need to carefully design an approach and
use all information we have
4
Overview of DISTINCT
• Measure similarity between references
– Linkages between references
• Based on our empirical study, references to the same
object are more likely to be connected
– Neighbor tuples of each reference
• Can indicate similarity between their contexts
• References clustering
– Group references according to their similarities
5
Similarity 1: Link-based Similarity
• Indicate the overall strength of connections
between two references
– We use random walk probability between the two
tuples containing the references
– Random walk probabilities along different join
paths are handled separately
• Because different join paths have different semantic
meanings
• Only consider join paths of length at most 2L (L is the
number of steps of propagating probabilities)
6
Example of Random Walk
Publish
Authors
1.0 vldb/wangym97
Wei Wang
0.5 vldb/wangym97
Jiong Yang
0.5
Jiong Yang
0.5 Richard Muntz
0.5 vldb/wangym97 Richard Muntz
Publications
1.0 vldb/wangym97
STING: A Statistical Information Grid
Approach to Spatial Data Mining
vldb/vldb97
Proceedings
vldb/vldb97 Very Large Data Bases 1997 Athens, Greece
1.0
7
Path Decomposition
• It is very expensive to propagate IDs and
probabilities along a long join path
• Divide each join path into two parts with equal (or
almost equal) length
– Compute the probability from each target object to each
tuple, and from each tuple to each target object
Rt
R1
R2
0/0
Origin of probability
propagation
1/1
0.1/0.25
0.2/0.5
0.2/0.5
0.5/0.5
0.2/1
0.2/1
0.4/0.5
0.2/0.5
0/0
0/0
8
Path Decomposition (cont.)
• When computing probability of walking from
one target object to another
– Combine the forward and backward probabilities
R1
R2
……
……
……
……
……
……
……
……
R1: 0.1/0.25
R2: 0.2/0.1
t1
t2
t3
t4
Probability of walking from O1 to O2 through t1 is 0.1*0.1 = 0.01
9
Similarity 2: Neighborhood Similarity
• Find the neighbor tuples of each reference
– Neighbor tuples within L joins
• Weights of neighbor tuples
– Different neighbor tuples have different
connection to the reference
– Assign each neighbor tuple a weight, which is
the probability of walking from the reference to
this tuple
• Similarity: Set resemblance between two
sets of neighbor tuples
10
Example of Neighbor Objects
Authors
0.2 Richard Muntz
Publish
vldb/wangym97
Wei Wang
……
0.2
Jiong Yang
0.1
Jinze Liu
0.05 Haixun Wang
11
Automatically Build A Training Set
• Find objects with unique names
– For persons’ names
• A person’s name = first name + last name (+middle name)
• A rare first name + a rare last name → this name is likely to be
unique
• For other types of objects (paper titles, product names, …),
consider the IDF of each term in object names
• Use two references to an object with unique name
as a positive example
Johannes Gehrke
• Use two references to different objects as a
negative example
12
Selecting Important Join Paths
Join Paths
……
coauthor
……
A1
B1
C1
A2
……
……
conference
year
……
……
B2
C2
• Each pair of training objects are connected with
certain probability via each join path
• Construct training set: Join path → Feature
• Use SVM to learn the weight of each join path
13
Clustering References
• We choose agglomerative hierarchical
clustering because
– We do not know number of clusters (real entities)
– We only know similarity between references
– Equivalent references can be merged into a
cluster, which represents a single entity
14
How to measure similarity between clusters?
• Single-link (highest similarity between points in two
clusters) ?
– No, because references to different objects can be
connected.
• Complete-link (minimum similarity between them)?
– No, because references to the same object may be
weakly connected.
• Average-link (average similarity between points in
two clusters)?
– A better measure
15
Problem with Average-link
C1
C2
C3
• C2 is close to C1, but their average similarity is low
• We use collective random walk probability: Probability of
walking from one cluster to another
• Final measure:
Average neighborhood similarity and Collective random
walk probability
16
Clustering Procedure
• Procedure
– Initialization: Use each reference as a cluster
– Keep finding and merging the most similar pair of
clusters
– Until no pair of clusters is similar enough
17
Efficient Computation
• In agglomerative hierarchical clustering, one needs
to repeatedly compute similarity between clusters
– When merging clusters C1 and C2 into C3, we need to
compute the similarity between C3 and any other cluster
– Very expensive when clusters are large
• We invent methods to compute similarity
incrementally
– Neighborhood similarity
– Random walk probability
18
Experimental Results
• Distinguishing references to authors in DBLP
– Accuracy of reference clustering
• True positive: Number of pairs of references to same
author in same cluster
• False positive: Different authors, in same cluster
• False negative: Same author, different clusters
• True negative: Different authors, different clusters
Precision = TP/(TP+FP)
Recall = TP/(TP+FN)
f-measure = 2*precision*recall / (precision+recall)
Accuracy = TP/(TP+FP+FN)
19
Accuracy on Synthetic Tests
• Select 1000 authors with at least 5 papers
• Merge G (G=2 to 10) authors into one group
• Use DISTINCT to distinguish each group of references
0.7
1
0.5
Precision
0.8
G2
G4
G6
G8
G10
0.4
0.6
G2
G4
G6
G8
G10
0.4
0.2
min-sim
1E-10
1.00E-05
5.00E-05
0.0002
0.001
0.005
0.3
1
Accuracy
0.6
0
0
0.2
0.4
0.6
0.8
1
Recall
20
Compare with “Existing Approaches”
• Random walk and neighborhood similarity have
been used in duplicate detection
• We combine them with our clustering approaches
for comparison
0.8
0.6
Max f-measure
Max accuracy
0.8
0.4
DISTINCT
Set resemblance - unsuper.
Random walk - unsuper.
Combined
0.2
0
2
4
6
Group size
8
0.6
0.4
DISTINCT
Set resemblance - unsuper.
Random walk - unsuper.
Combined
0.2
0
10
2
4
6
Group size
8
10
21
Real Cases
Name
#author #ref
accuracy precision recall
f-measure
Hui Fang
3
9
1.0
1.0
1.0
1.0
Ajay Gupta
4
16
1.0
1.0
1.0
1.0
Joseph Hellerstein
2
151
0.81
1.0
0.81
0.895
Rakesh Kumar
2
36
1.0
1.0
1.0
1.0
Michael Wagner
5
29
0.395
1.0
0.395
0.566
Bing Liu
6
89
0.825
1.0
0.825
0.904
Jim Smith
3
19
0.829
0.888
0.926
0.906
Lei Wang
13
55
0.863
0.92
0.932
0.926
Wei Wang
14
141
0.716
0.855
0.814
0.834
Bin Yu
5
44
0.658
1.0
0.658
0.794
0.81
0.966
0.836
0.883
average
22
Real Cases: Comparison
1
DISTINCT
0.9
Supervised set
resemblance
Supervised random
walk
Unsupervised
combined measure
Unsupervised set
resemblance
Unsupervised
random walk
0.8
0.7
0.6
0.5
0.4
accuracy
f-measure
23
Distinguishing Different “Wei Wang”s
UNC-CH
(57)
Fudan U, China
(31)
Zhejiang U
China
(3)
Najing Normal
China
(3)
SUNY
Binghamton
(2)
Ningbo Tech
China
(2)
Purdue
(2)
Chongqing U
China
(2)
Harbin U
China
(5)
Beijing U Com
China
5
2
UNSW, Australia
(19)
6
SUNY
Buffalo
(5)
Beijing
Polytech
(3)
NU
Singapore
(5)
(2)
24
Scalability
• Agglomerative hierarchical clustering takes
quadratic time
– Because it requires computing pair-wise similarity
16
Time (second)
12
8
4
0
0
50
100
150
200
#references
250
300
350
25
Thank you!
26
Self-loop Phenomenon
• In a relational database, each object connects to
other objects via different join paths
– In DBLP: author → paper → conference → …
• An object usually connects to itself more
frequently than most other objects of same type
– E.g. an author in DBLP usually connects to himself
more frequently than to other authors
KDD04
KDD
TKDE vol16
Charu Aggarwal
associate
editor
TKDE
TKDE vol15
Self-loops: connect to herself via proceedings, conferences, areas, …
27
An Experiment
• Consider a relational database as a graph
– Use DBLP as an example
• Randomly select an author x
– Perform random walks starting from x
• With 8 steps
– For each author y, calculate probability of reaching y
from x
• Not considering paths like “x→a→ … →a→x”
– Rank all authors by the probability
• Perform the above procedure for 500 authors (with
at least 10 papers) who are randomly selected
from 31608 authors
28
Results
• For each selected author x, rank all authors
by random walk probability from x
• Rank of x himself
Rank of x within:
Portion of authors with such “self-rank”
in the 500 authors
Top 0.1%
62%
Top 1%
86%
Top 10%
99%
Top 50%
100%
29
100%
80%
80%
|
100%
Portion of tuples
Portion of tuples
|
Results (curves)
60%
40%
#path
20%
60%
40%
#path
20%
Random walk
Random walk
0%
0%
0.01% 0.05% 0.1% 0.5%
1%
5%
Self-rank
DBLP Database
10%
50%
1%
2%
5%
10%
20%
50%
100%
Self-rank
CS Dept Database
30
Download