Linking Records with Erroneous Values Songtao Guo, Xin Luna Dong, Divesh Srivastava, and Remi Zajac AT&T Labs 1 Motivation s s s s s s Src Name Phone Address City V A-Link Wireless 8185491449 2148 GLENDALE GALLERIA GLENDALE V Abercrombie 8185020728 2229 GLENDALE GALLERIA GLENDALE V Abercrombie & Fitch 8185507492 2151 GLENDALE GALLERIA GLENDALE V Aeropostale 8185458972 2187 GLENDALE GALLERIA GLENDALE V Aerosoles 8182462455 1163 GLENDALE GALLERIA GLENDALE V 2034266114 65 Church hill Rd NEWTOWN Src Newtown Pizza Palace Pizza Palace Of Newtown Name 2034266114 65 Church hill Rd NEWTOWN D D Aerosoles Aldo Shoes D Newtown Pizza Palace V D Cleaned Data Search Box Phone Address City 8182462455 1163 GLENDALE GALLERIA GLENDALE 8184090612 1157 GLENDALE GALLERIA GLENDALE 2034266114 Pizza Palace of Newtown 2034266114 65 Church hill Rd Newtown Church Hill Rd Newtown Src Name Phone Address City A A A A A A A A 24 Hour 1 A 1 Locksmith A Link Wireless Abercrombie Abercrombie & Fitch Newtown Pizza Palace Aldo Shoes Alert Cellular 8182404644 8185491449 8185020728 8185507492 2034266114 8185482540 8182404779 3210 GLENDALE GALLERIA 2148 GLENDALE GALLERIA 2229 GLENDALE GALLERIA 2151 GLENDALE GALLERIA 65 Church hill Rd 2154 GLENDALE GALLERIA 2148 GLENDALE GALLERIA GLENDALE GLENDALE GLENDALE GLENDALE Newtown GLENDALE GLENDALE Src Name Phone Address City T T T T T Newtown Pizza Palace Aldo Shoes American Eagle Outfitters ANN TAYLOR Ann Taylor Stores 2034266114 8185482540 8189561893 8182460350 8182460350 65 Church hill Rd 2154 GLENDALE GALLERIA 2182 GLENDALE GALLERIA 2178 GLENDALE GALLERIA 1108 GLENDALE GALLERIA Newtown GLENDALE GLENDALE GLENDALE 2 GLENDALE Motivation Which type of listing are they? • A: the same business • B: different businesses sharing the same phone# • C: different businesses, only one correctly associated with the given phone# 3 Current Solution • Uniqueness constraint – Each real-world entity has a unique value. E.g., phone, address • The data may not satisfy the constraint – Erroneous values – Small number of exceptions • Current two-step solution – Step 1: Record Linkage • link records that are likely to refer to the same real-world entity [A.K Elmagarmid, TKDE’07], [W.Winkler, Tech Report’06] – Step 2: Data Fusion • decide the correct values in the presence of conflicts [J. Bleiholder et. al, ACM Computing Surveys] 4 Limitations of Current Solution SOURCE s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 NAME Microsofe Corp. Microsofe Corp. Macrosoft Inc. Microsoft Corp. Microsofe Corp. Macrosoft Inc. Microsoft Corp. Microsoft Corp. Macrosoft Inc. Microsoft Corp. Microsoft Corp. Macrosoft Inc. Microsoft Corp. Microsoft Corp. Macrosoft Inc. Microsoft Corp. Macrosoft Inc. MS Corp. Macrosoft Inc. MS Corp. Macrosoft Inc. Macrosoft Inc. MS Corp. PHONE xxx-1255 xxx-9400 xxx-0500 xxx-1255 xxx-9400 xxx-0500 xxx-1255 xxx-9400 xxx-0500 xxx-1255 xxx-9400 xxx-0500 xxx-1255 xxx-9400 xxx-0500 xxx-2255 xxx-0500 xxx-1255 xxx-0500 xxx-1255 xxx-0500 xxx-0500 xxx-0500 ADDRESS ✓ ✓ ✗ 1 Microsoft Way 1 Microsoft Way 2 Sylvan W. 1 Microsoft Way 1 Microsoft Way 2 Sylvan Way 1 Microsoft Way 1 Microsoft Way 2 Sylvan Way 1 Microsoft Way 1 Microsoft Way 2 Sylvan Way 1 Microsoft Way 1 Microsoft Way 2 Sylvan Way 1 Microsoft Way 2 Sylvan Way 1 Microsoft Way 2 Sylvan Way 1 Microsoft Way 2 Sylvan Way 2 Sylvan Way 2 Sylvan Way (Microsoft Corp. ,Microsofe Corp., MS Corp.) (XXX-1255, xxx-9400) (1 Microsoft Way) (Macrosoft Inc.) (XXX-0500) (2 Sylvan Way, 2 Sylvan W.) Erroneous values may prevent correct matching Traditional techniques may fall short when exceptions to the uniqueness constraints exist Locally resolving conflicts for linked records may overlook important global evidence 5 Our Solution • Perform linkage and fusion simultaneously – Able to identify incorrect value from the beginning, so can improve linkage • Make global decisions – Consider sources that associate a pair of values in the same record, so can improve fusion • Allow small number of violations for capturing possible exceptions in the real world 6 Road Map • Motivation and overview • Problem definition • Solution • Evaluations on YP data • Conclusions 7 Problem Input • A set of independent data sources, each providing a set of records • A set of (soft) uniqueness constraints – Uniqueness constraint (hard constraint): • Business Name, Business Phone, Business Address – Soft uniqueness constraint (soft constraint): 1-p1 • Business Phone 1-p 2 8 Problem Output • Real-world entities • For each (soft) uniqueness attribute of each entity – True value (if any) – Various representations of each true value (Microsoft Corp. ,Microsofe Corp., MS Corp.) (XXX-1255, xxx-9400) (1 Microsoft Way) (Macrosoft Inc.) (XXX-0500) (2 Sylvan Way, 2 Sylvan W.) 9 K-Partite Graph Encoding N1 N3 N2 N4 S(7-8) s(1-2) s(1) S(3-5) s(2-5) S(10) S(1-9) s(6) P1 s(1-2) P2 s(1) s(1-5,7,8) S(10) P4 S(7-8) S(2-10) A1 A2 1 Microsoft Way 2 Sylvan Way Microsofe Corp. S(2-9) s(1) s(1-5) s(6) s(1) S1 P3 s(2-6) XXX-1255 1 Microsoft Way s(1) A3 2 Sylvan W. 10 Solution Encoding N1 N3 N2 P1 P2 P3 N4 P4 A1 A2 1 Microsoft Way 2 Sylvan Way A3 2 Sylvan W. Clustering problem & Matching problem 11 Solution Encoding with Hard Constraint N1 N2 P1 A1 N3 N4 P2 P3 C2 C3 1 Microsoft Way C1 P4 A2 A3 2 Sylvan Way Clustering problem C4 2 Sylvan W. 12 Road Map • Motivation and overview • Problem definition • Solution • Clustering w.r.t. hard constraint • Matching w.r.t. soft constraint • Evaluations on YP data • Conclusions 13 Clustering w.r.t. Hard Constraints • Ideal clustering: N1 N2 N3 N4 P1 A1 1 Microsoft Way C1 • Objective function P4 A2 – Davis-Bouldin Index (Minimization) A3 2 Sylvan Way 2 Sylvan W. C4 – high cohesion within each cluster – low correlation between different clusters • Average distance of – similarity distance – association distance Similarity Distance 0.7 0.65 0.95 N1 0.7 0.4 0.65 N2 • Similarity of values • Defined for each attribute N3 N4 0 P1 P4 d1S(C1,C1) = 1 − (0.95+0.65+0.65)/3 = 0.25 (name) d2S(C1,C1) = 0 (phone) d3S(C1,C1) = 0 (address) dS(C1,C1) = (0.25+0+0)/3 = 0.083 0 A1 1 Microsoft Way 0 A2 A3 0.9 2 Sylvan Way 2 Sylvan W. d1S(C1,C4) = 1 − (0.7+0.7+0.4)/3 = 0.4 2 d S(C1,C4) = 1-0 = 1 d3S(C1,C4) = 1-0 = 1 dS(C1,C4) = (0.4+1+1)/3=0.8 15 C1 C4 Association Distance N1 N3 N2 9 sources (S1-S8,S10) mention (N1,N2,N3,P1) • 7 sources (S1-S5,S7,S8) Support (N1,N2,N3)-P1 • N4 s(1-2) 10 sources (S1-S10) mention (N1,N2,N3,N4) (P1,P4) Association by (S10) edges 1 source supports (N1,N2,N3)-P4 Defined for each pair of No connection between attributes (N4,P1) S(3-5) S(10) s(2-5) s(1) S(1-9) S(7-8) P1 s(2-6) S(7-8) S(10) P4 S(2-9) s(1) s(1-2) s(1-5,7,8) A1 1 Microsoft Way S(2-10) A2 s(1) A3 2 Sylvan Way 2 Sylvan W. d1,2A (C1,C1) = 1 − 7/9 = 0.22 d1,3A(C1,C1) = 1− 8/9 = 0.11 d2,3A (C1,C1) = 1− 7/8 = 0.125 dA(C1,C1) = (0.22+0.11+0.125)/3 = 0.153 d1,2A (C1,C4) = 1 − max(1/10,0/10) = 0.9 d1,3A(C1,C4) = 0.9 d2,3A (C1,C4) = 1 dA(C1,C4) = (0.9+0.9+1)/3 = 0.93 C1 C4 16 Greedy Algorithm • Obtaining optimal clustering is intractable – [T.F. Gonzales., 82],[J. Simal et al., 06] • Hill climbing approximation: CLUSTER – Step1: Initialization • Cluster value representations by their similarity. Do majority voting to associate clusters – Step2: Adjustment • For each node, moving to the cluster that minimize this DB index – Step3: Convergence checking • terminate if step 2 doesn’t change the clustering result. Otherwise, repeat step 2 • The algorithm converges 17 Φ=0.94 Φ=0.93 Φ=1.16 N3 N1 N2 N4 Φ=0.89 Φ=0.71 Φ=0.45 P1 P4 P3 P2 A1 A2 1 Microsoft Way 2 Sylvan Way C1 C2 C3 A3 C4 2 Sylvan W. 18 Road Map • Motivation and overview • Problem definition • Solution • Clustering w.r.t. hard constraint • Matching w.r.t. soft constraint • Evaluations on YP data • Conclusions 19 Matching w.r.t. Soft Constraints MS Corp. Microsoft Corp. Microsofe Corp. N1 N2 N3 NC1 N4 7 s(1-5,7,8) P1 P2 P3 P4 PC1 NC4 1 5 S(6) s(1-5) PC3 PC2 GRAPH TRANSFORM A1 1 Microsoft Way A2 Macrosoft Inc. A3 2 Sylvan Way 2 Sylvan W. 1 9 S(10) S(1-9) PC4 8 1 S(1-8) S(10) AC1 1 Microsoft Way 9 S(1-9) AC4 2 Sylvan W. 2 Sylvan Way • Next? Matching problem • How to match? 20 Matching w.r.t. Soft Constraint • Intuitions Solution 1 – Largest sum of weights – Smallest gap – How to balance these two goals? N 1 10 9 (s1) (s2-s10) (s1-s10) P1 • Optimization problem – Maximize ( u , v ) M – Subject to 0 P2 P3 Gap(N) = 1 w (u , v ) Gap ( u ) Gap ( v ) | Aˆ K | | AK | p1 0 | Aˆ | | A| Solution 2 Solution 3 N N p2 • Two-phase greedy algorithm: MATCH 9 10 (s2-s10) (s1-s10) (s1) 1 P1 P2 P3 Gap(N) = 9 1 9 10 (s1) (s2-s10) (s1-s10) P1 P2 P3 Gap(N) = 0 21 Road Map • • • • • Motivation and overview Problem definition Solution Evaluations on YP data Conclusions 22 Experiment Settings • Dataset I – Business listings for two zip codes(07035-Lincoln Park NJ, 07715-Belmar, NJ) from multiple sources Zip Business 07035 07715 662 149 Zip 07035 07715 Zip 07035 07715 Source #Sources #Srcs/business 15 6 1-7 1-3 Records #Recs #Names #Phones #Addresses #(Err Ps) 1629 266 1154 243 839 184 735 55 72 12 Constraint Violation NP PN NA AN 8%(2.6) 4%(2) .8%(2.7) 1%(3) 2%(2.3) 4%(2) 12.6%(5.1) 4%(8.5) 23 Experiment Settings • Implementation – – – – MATCH (invoking CLUSTER first) LINK: record linkage only FUSE: data fusion only LINKFUSE: first LINK, then FUSE • Golden Standard: by manually checking • Measures: Precision/Recall/F-measure Matching of values of different attributes Precision Recall F-measure P R F | G M RM | | RM | Clustering of values of the same attribute P GM Matched pairs for the golden standard | RA | RM Matched pairs for our results GA Clustered pairs for the golden standard RA Clustered pairs for our results | AR AG | | GM | | AG | P R F Description | GA RA | | G M RM | 2 PR Notation 2 PR P R R 24 Accuracy • • MATCH achieves highest F-measure in most cases • Improves LINK by 11% on name-phone matching, by 20% on name clustering LINK vs. FUSE vs. LINKFUSE • LINK: high recall in matching • FUSE: high precision in matching, high precision in name clustering • LINKFUSE: only slightly better than FUSE in matching and similar to LINK in clustering 07035 Matching (NAME-PHONE) 07035 Matching (NAME-ADDRESS) 07035 Clustering (NAME) 25 07715 Matching (NAME-PHONE) 07715 Matching (NAME-ADDRESS) 07715 Clustering (NAME) Efficiency and Scalability • Data set II – Entire listing: 40+M records • Hadoop-based linkage framework – Fuzzy self-join using Hadoop – Partition records into strongly connected components median 2 95th percentile 5 99th percentile 7 max 2103 • Efficiency – Linear growth – Execution time Module Execution time (hour) Record extraction 0.002 Fuzzy self join 0.89 Connected component 0.89 linkage 1.36 Overall 3.26 26 Conclusions • In the real-world, we need to resolve duplicates and conflicts at the same time. • We reduce the problem to a k-partite graph clustering and matching problem – Combine linkage and fusion – Apply them in the global fashion • Experiments show high accuracy and scalability 27 Thank You! 28