Presentation

advertisement
Linking Records with
Erroneous Values
Songtao Guo, Xin Luna Dong,
Divesh Srivastava, and Remi Zajac
AT&T Labs
1
Motivation
s
s
s
s
s
s
Src
Name
Phone
Address
City
V
A-Link Wireless
8185491449
2148 GLENDALE GALLERIA
GLENDALE
V
Abercrombie
8185020728
2229 GLENDALE GALLERIA
GLENDALE
V
Abercrombie & Fitch
8185507492
2151 GLENDALE GALLERIA
GLENDALE
V
Aeropostale
8185458972
2187 GLENDALE GALLERIA
GLENDALE
V
Aerosoles
8182462455
1163 GLENDALE GALLERIA
GLENDALE
V
2034266114
65 Church hill Rd
NEWTOWN
Src
Newtown Pizza Palace
Pizza Palace Of
Newtown
Name
2034266114
65 Church hill Rd
NEWTOWN
D
D
Aerosoles
Aldo Shoes
D
Newtown Pizza Palace
V
D
Cleaned
Data
Search
Box
Phone
Address
City
8182462455 1163 GLENDALE GALLERIA GLENDALE
8184090612 1157 GLENDALE GALLERIA GLENDALE
2034266114
Pizza Palace of Newtown 2034266114
65 Church hill Rd
Newtown
Church Hill Rd
Newtown
Src
Name
Phone
Address
City
A
A
A
A
A
A
A
A 24 Hour 1 A 1 Locksmith
A Link Wireless
Abercrombie
Abercrombie & Fitch
Newtown Pizza Palace
Aldo Shoes
Alert Cellular
8182404644
8185491449
8185020728
8185507492
2034266114
8185482540
8182404779
3210 GLENDALE GALLERIA
2148 GLENDALE GALLERIA
2229 GLENDALE GALLERIA
2151 GLENDALE GALLERIA
65 Church hill Rd
2154 GLENDALE GALLERIA
2148 GLENDALE GALLERIA
GLENDALE
GLENDALE
GLENDALE
GLENDALE
Newtown
GLENDALE
GLENDALE
Src
Name
Phone
Address
City
T
T
T
T
T
Newtown Pizza Palace
Aldo Shoes
American Eagle Outfitters
ANN TAYLOR
Ann Taylor Stores
2034266114
8185482540
8189561893
8182460350
8182460350
65 Church hill Rd
2154 GLENDALE GALLERIA
2182 GLENDALE GALLERIA
2178 GLENDALE GALLERIA
1108 GLENDALE GALLERIA
Newtown
GLENDALE
GLENDALE
GLENDALE
2
GLENDALE
Motivation
Which type of listing
are they?
• A: the same business
• B: different businesses sharing
the same phone#
• C: different businesses, only
one correctly associated with the
given phone#
3
Current Solution
• Uniqueness constraint
– Each real-world entity has a unique value.
E.g., phone, address
• The data may not satisfy the constraint
– Erroneous values
– Small number of exceptions
• Current two-step solution
– Step 1: Record Linkage
• link records that are likely to refer to the same real-world
entity [A.K Elmagarmid, TKDE’07], [W.Winkler, Tech Report’06]
– Step 2: Data Fusion
• decide the correct values in the presence of conflicts
[J. Bleiholder et. al, ACM Computing Surveys]
4
Limitations of Current Solution
SOURCE
s1
s2
s3
s4
s5
s6
s7
s8
s9
s10
NAME
Microsofe Corp.
Microsofe Corp.
Macrosoft Inc.
Microsoft Corp.
Microsofe Corp.
Macrosoft Inc.
Microsoft Corp.
Microsoft Corp.
Macrosoft Inc.
Microsoft Corp.
Microsoft Corp.
Macrosoft Inc.
Microsoft Corp.
Microsoft Corp.
Macrosoft Inc.
Microsoft Corp.
Macrosoft Inc.
MS Corp.
Macrosoft Inc.
MS Corp.
Macrosoft Inc.
Macrosoft Inc.
MS Corp.
PHONE
xxx-1255
xxx-9400
xxx-0500
xxx-1255
xxx-9400
xxx-0500
xxx-1255
xxx-9400
xxx-0500
xxx-1255
xxx-9400
xxx-0500
xxx-1255
xxx-9400
xxx-0500
xxx-2255
xxx-0500
xxx-1255
xxx-0500
xxx-1255
xxx-0500
xxx-0500
xxx-0500
ADDRESS
✓
✓
✗
1 Microsoft Way
1 Microsoft Way
2 Sylvan W.
1 Microsoft Way
1 Microsoft Way
2 Sylvan Way
1 Microsoft Way
1 Microsoft Way
2 Sylvan Way
1 Microsoft Way
1 Microsoft Way
2 Sylvan Way
1 Microsoft Way
1 Microsoft Way
2 Sylvan Way
1 Microsoft Way
2 Sylvan Way
1 Microsoft Way
2 Sylvan Way
1 Microsoft Way
2 Sylvan Way
2 Sylvan Way
2 Sylvan Way
(Microsoft Corp. ,Microsofe Corp., MS Corp.)
(XXX-1255, xxx-9400)
(1 Microsoft Way)
(Macrosoft Inc.)
(XXX-0500)
(2 Sylvan Way, 2 Sylvan W.)
Erroneous values may prevent
correct matching
Traditional techniques may fall
short when exceptions to the
uniqueness constraints exist
Locally resolving conflicts for
linked records may overlook
important global evidence
5
Our Solution
• Perform linkage and fusion simultaneously
– Able to identify incorrect value from the
beginning, so can improve linkage
• Make global decisions
– Consider sources that associate a pair of values in
the same record, so can improve fusion
• Allow small number of violations for capturing
possible exceptions in the real world
6
Road Map
• Motivation and overview
• Problem definition
• Solution
• Evaluations on YP data
• Conclusions
7
Problem Input
• A set of independent data sources, each
providing a set of records
• A set of (soft) uniqueness constraints
– Uniqueness constraint (hard constraint):
• Business Name, Business Phone, Business
Address
– Soft uniqueness constraint (soft constraint):
1-p1
• Business Phone
1-p
2
8
Problem Output
• Real-world entities
• For each (soft) uniqueness attribute of each
entity
– True value (if any)
– Various representations of each true value
(Microsoft Corp. ,Microsofe Corp.,
MS Corp.)
(XXX-1255, xxx-9400)
(1 Microsoft Way)
(Macrosoft Inc.)
(XXX-0500)
(2 Sylvan Way, 2 Sylvan
W.)
9
K-Partite Graph Encoding
N1
N3
N2
N4
S(7-8)
s(1-2)
s(1)
S(3-5)
s(2-5)
S(10)
S(1-9)
s(6)
P1
s(1-2)
P2
s(1)
s(1-5,7,8)
S(10)
P4
S(7-8)
S(2-10)
A1
A2
1 Microsoft Way
2 Sylvan Way
Microsofe Corp.
S(2-9)
s(1)
s(1-5)
s(6)
s(1)
S1
P3
s(2-6)
XXX-1255
1 Microsoft Way
s(1)
A3
2 Sylvan W.
10
Solution Encoding
N1
N3
N2
P1
P2
P3
N4
P4
A1
A2
1 Microsoft Way
2 Sylvan Way
A3
2 Sylvan W.
Clustering problem & Matching problem
11
Solution Encoding with Hard
Constraint
N1
N2
P1
A1
N3
N4
P2
P3
C2
C3
1 Microsoft Way
C1
P4
A2
A3
2 Sylvan Way
Clustering problem
C4
2 Sylvan W.
12
Road Map
• Motivation and overview
• Problem definition
• Solution
• Clustering w.r.t. hard constraint
• Matching w.r.t. soft constraint
• Evaluations on YP data
• Conclusions
13
Clustering w.r.t. Hard Constraints
• Ideal clustering:
N1
N2
N3
N4
P1
A1
1 Microsoft Way
C1
• Objective function
P4
A2
– Davis-Bouldin Index
(Minimization)
A3
2 Sylvan Way 2 Sylvan W.
C4
– high cohesion within
each cluster
– low correlation
between different
clusters
• Average distance of
– similarity distance
– association distance
Similarity Distance
0.7
0.65
0.95
N1
0.7
0.4
0.65
N2
• Similarity of values
• Defined for each attribute
N3
N4
0
P1
P4
d1S(C1,C1) = 1 − (0.95+0.65+0.65)/3
= 0.25 (name)
d2S(C1,C1) = 0 (phone)
d3S(C1,C1) = 0 (address)
dS(C1,C1) = (0.25+0+0)/3 = 0.083
0
A1
1 Microsoft Way
0
A2
A3
0.9
2 Sylvan Way 2 Sylvan W.
d1S(C1,C4) = 1 − (0.7+0.7+0.4)/3
= 0.4
2
d S(C1,C4) = 1-0 = 1
d3S(C1,C4) = 1-0 = 1
dS(C1,C4) = (0.4+1+1)/3=0.8
15
C1
C4
Association Distance
N1
N3
N2
9 sources (S1-S8,S10)
mention (N1,N2,N3,P1)
•
7 sources (S1-S5,S7,S8)
Support (N1,N2,N3)-P1 •
N4
s(1-2)
10 sources (S1-S10)
mention (N1,N2,N3,N4) (P1,P4)
Association
by (S10)
edges
1 source
supports (N1,N2,N3)-P4
Defined
for each pair of
No connection between
attributes (N4,P1)
S(3-5)
S(10)
s(2-5)
s(1)
S(1-9)
S(7-8)
P1
s(2-6)
S(7-8)
S(10)
P4
S(2-9)
s(1)
s(1-2)
s(1-5,7,8)
A1
1 Microsoft Way
S(2-10)
A2
s(1)
A3
2 Sylvan Way
2 Sylvan W.
d1,2A (C1,C1) = 1 − 7/9 = 0.22
d1,3A(C1,C1) = 1− 8/9 = 0.11
d2,3A (C1,C1) = 1− 7/8 = 0.125
dA(C1,C1) = (0.22+0.11+0.125)/3
= 0.153
d1,2A (C1,C4) = 1 − max(1/10,0/10)
= 0.9
d1,3A(C1,C4) = 0.9
d2,3A (C1,C4) = 1
dA(C1,C4) = (0.9+0.9+1)/3 = 0.93
C1
C4
16
Greedy Algorithm
• Obtaining optimal clustering is intractable
– [T.F. Gonzales., 82],[J. Simal et al., 06]
• Hill climbing approximation: CLUSTER
– Step1: Initialization
• Cluster value representations by their similarity. Do majority voting
to associate clusters
– Step2: Adjustment
• For each node, moving to the cluster that minimize this DB index
– Step3: Convergence checking
• terminate if step 2 doesn’t change the clustering result.
Otherwise, repeat step 2
• The algorithm converges
17
Φ=0.94
Φ=0.93
Φ=1.16
N3
N1
N2
N4
Φ=0.89
Φ=0.71
Φ=0.45
P1
P4
P3
P2
A1
A2
1 Microsoft Way
2 Sylvan Way
C1
C2
C3
A3
C4
2 Sylvan W.
18
Road Map
• Motivation and overview
• Problem definition
• Solution
• Clustering w.r.t. hard constraint
• Matching w.r.t. soft constraint
• Evaluations on YP data
• Conclusions
19
Matching w.r.t. Soft Constraints
MS Corp.
Microsoft Corp.
Microsofe Corp.
N1
N2
N3
NC1
N4
7
s(1-5,7,8)
P1
P2
P3
P4
PC1
NC4
1
5
S(6)
s(1-5)
PC3
PC2
GRAPH TRANSFORM
A1
1 Microsoft Way
A2
Macrosoft Inc.
A3
2 Sylvan Way 2 Sylvan W.
1
9
S(10) S(1-9)
PC4
8
1
S(1-8)
S(10)
AC1
1 Microsoft Way
9
S(1-9)
AC4
2 Sylvan W.
2 Sylvan Way
• Next? Matching problem
• How to match?
20
Matching w.r.t. Soft Constraint
• Intuitions
Solution 1
– Largest sum of weights
– Smallest gap
– How to balance these two goals?
N
1
10
9
(s1) (s2-s10) (s1-s10)
P1
• Optimization problem
– Maximize

( u , v ) M
– Subject to
0
P2
P3
Gap(N) = 1
w (u , v )
Gap ( u )  Gap ( v )  
| Aˆ K |
| AK |
 p1  
0
| Aˆ |
| A|
Solution 2
Solution 3
N
N
 p2  
• Two-phase greedy algorithm:
MATCH
9
10
(s2-s10)
(s1-s10)
(s1)
1
P1
P2
P3
Gap(N) = 9
1
9
10
(s1) (s2-s10) (s1-s10)
P1
P2
P3
Gap(N) = 0
21
Road Map
•
•
•
•
•
Motivation and overview
Problem definition
Solution
Evaluations on YP data
Conclusions
22
Experiment Settings
• Dataset I
– Business listings for two zip codes(07035-Lincoln Park
NJ, 07715-Belmar, NJ) from multiple sources
Zip
Business
07035
07715
662
149
Zip
07035
07715
Zip
07035
07715
Source
#Sources
#Srcs/business
15
6
1-7
1-3
Records
#Recs
#Names
#Phones
#Addresses
#(Err Ps)
1629
266
1154
243
839
184
735
55
72
12
Constraint Violation
NP
PN
NA
AN
8%(2.6)
4%(2)
.8%(2.7)
1%(3)
2%(2.3)
4%(2)
12.6%(5.1)
4%(8.5)
23
Experiment Settings
• Implementation
–
–
–
–
MATCH (invoking CLUSTER first)
LINK: record linkage only
FUSE: data fusion only
LINKFUSE: first LINK, then FUSE
• Golden Standard: by manually checking
• Measures: Precision/Recall/F-measure
Matching of values of
different attributes
Precision
Recall

F-measure
P 
R 
F 
| G M  RM |
| RM |
Clustering of values
of the same attribute
P 
GM
Matched pairs for the golden standard
| RA |
RM
Matched pairs for our results

GA
Clustered pairs for the golden standard

RA
Clustered pairs for our results
| AR  AG |
| GM |
| AG |
P  R
F 
Description
| GA  RA |
| G M  RM |
2 PR
Notation
2 PR
P R
 R


24
Accuracy
•
•
MATCH achieves highest F-measure in most cases
• Improves LINK by 11% on name-phone matching, by 20% on name clustering
LINK vs. FUSE vs. LINKFUSE
• LINK: high recall in matching
• FUSE: high precision in matching, high precision in name clustering
• LINKFUSE: only slightly better than FUSE in matching and similar to LINK in
clustering
07035 Matching (NAME-PHONE)
07035 Matching (NAME-ADDRESS)
07035 Clustering (NAME)
25
07715 Matching (NAME-PHONE)
07715 Matching (NAME-ADDRESS)
07715 Clustering (NAME)
Efficiency and Scalability
• Data set II
– Entire listing: 40+M records
• Hadoop-based linkage framework
– Fuzzy self-join using Hadoop
– Partition records into strongly connected components
median
2
95th
percentile
5
99th
percentile
7
max
2103
• Efficiency
– Linear growth
– Execution time
Module
Execution time (hour)
Record extraction
0.002
Fuzzy self join
0.89
Connected component
0.89
linkage
1.36
Overall
3.26
26
Conclusions
• In the real-world, we need to resolve
duplicates and conflicts at the same time.
• We reduce the problem to a k-partite graph
clustering and matching problem
– Combine linkage and fusion
– Apply them in the global fashion
• Experiments show high accuracy and
scalability
27
Thank You!
28
Download