An Unsupervised Algorithm for Learning Blocking Schemes Mayank Kejriwal and Daniel P. Miranker University of Texas at Austin Unsupervised Deduplication: a 40 year old problem, still open! Clean Data Dirty Data Deduplication No user input! Deduplication: what is it? ID Name Address City Cuisine 1 Fenix 8358 Sunset Blvd. West Hollywood American 2 Art’s Delicatessen 12224 Ventura Blvd. Studio City American 3 Hotel Bel-Air 701 Stone Canyon Bel Air Rd. Californian 4 Art’s Deli 12224 Ventura Blvd. Studio City Delis 5 Fenix at the Argyle 8358 Sunset Blvd. W. Hollywood French (new) 3 ‘entities’ being represented by 5 records! Other names and variants: Record Linkage, Entity Resolution… Deduplication: what is it? ID Name Address City Cuisine 1 Fenix 8358 Sunset Blvd. West Hollywood American 2 Art’s Delicatessen 12224 Ventura Blvd. Studio City American 3 Hotel Bel-Air 701 Stone Canyon Bel Air Rd. Californian 4 Art’s Deli 12224 Ventura Blvd. Studio City Delis 5 Fenix at the Argyle 8358 Sunset Blvd. W. Hollywood French (new) Suppose we have boolean similarity function Sim(r,s) that returns True iff Entity(r)=Entity(s) Compare every record to every other? What if… ID Name Address City Cuisine 1 Fenix 8358 Sunset Blvd. West Hollywood American 2 Art’s Delicatessen 12224 Ventura Blvd. Studio City American 3 Hotel Bel-Air 701 Stone Canyon Bel Air Rd. Californian 4 Art’s Deli 12224 Ventura Blvd. Studio City Delis 5 Fenix at the Argyle 8358 Sunset Blvd. W. Hollywood French (new) Suppose we group first on the Name attribute ID Name Address City Cuisine 4 Art’s Deli 12224 Ventura Blvd. Studio City Delis 2 Art’s Delicatessen 12224 Ventura Blvd. Studio City American 1 Fenix 8358 Sunset Blvd. West Hollywood American 5 Fenix at the Argyle 8358 Sunset Blvd. W. Hollywood French (new) 3 Hotel Bel-Air 701 Stone Canyon Bel Air Rd. Californian Only compare records within a group! Each such group constitutes a block Blocking • Key idea: not every pair of records needs comparing • Effective grouping requires good blocking scheme (e.g. Name) Unsupervised Deduplication: a 40 year old problem, still open! Deduplication Dirty Data Blocking Identify Duplicates Clean Data Unsupervised Deduplication: current state-of-the-art Deduplication Dirty Data Supervised: Blocking Unsupervised: Identify Duplicates Clean Data Unsupervised Deduplication: our contribution Deduplication Dirty Data Kejriwal and Miranker: Unsupervised: Blocking Unsupervised: Identify Duplicates Clean Data Two key questions • How do we construct the space of blocking schemes? • How do we efficiently learn a good blocking scheme from that space? Constructing the space of blocking schemes Give p General Indexing Functions (GIFs) Tokens(“Fenix at the Argyle”)={“Fenix”, “at”, “the”, “Argyle”} 1 Fenix 8358 Sunset Blvd. Hollywood West American 5 Fenix at the Argyle 8358 Sunset Blvd. W. Hollywood French (new) Constructing the space of blocking schemes Give p General Indexing Functions (GIFs) Construct mp Specific Indexing Functions (SIFs) Tokens.Name(r5)={“Fenix”, “at”, “the”, “Argyle”} 1 Fenix 8358 Sunset Blvd. Hollywood West American 5 Fenix at the Argyle 8358 Sunset Blvd. W. Hollywood French (new) Constructing the space of blocking schemes Give p General Indexing Functions (GIFs) Construct mp Specific Indexing Functions (SIFs) Construct mp Specific Blocking Predicates (SBPs) CommonToken.Name(r1,r5)={True} 1 Fenix 8358 Sunset Blvd. Hollywood West American 5 Fenix at the Argyle 8358 Sunset Blvd. W. Hollywood French (new) Constructing the space of blocking schemes Give p General Indexing Functions (GIFs) Construct mp Specific Indexing Functions (SIFs) Construct Space of 2 2 mp Construct mp Specific Blocking Predicates (SBPs) DNF Blocking Schemes DNF blocking schemes Construct Space of 2 2 mp DNF Blocking Schemes Example 1: (ExactMatch.City AND CommonInteger.Address) OR ExactMatch.Cuisine False! 1 Fenix 8358 Sunset Blvd. West Hollywood American 5 Fenix at the Argyle 8358 Sunset Blvd. W. Hollywood French (new) DNF blocking schemes Construct Space of Example 2: 2 2 mp DNF Blocking Schemes CommonToken.Name AND CommonInteger.Address True 1 Fenix 8358 Sunset Blvd. West Hollywood American 5 Fenix at the Argyle 8358 Sunset Blvd. W. Hollywood French (new) Learning a scheme Construct Space of 2 2 mp DNF Blocking Schemes Efficiently learn one scheme Unsupervised learning algorithm Tabular dataset Generate pseudo training set Perform feature selection DNF Blocking Scheme Tabular dataset Generate pseudo training set Perform feature selection DNF Blocking Scheme Step 1: Extract Term and Record Frequency Statistics 1 Fenix 8358 Sunset Blvd. West Hollywood 5 Fenix at the Argyle 8358 Sunset Blvd. W. Hollywood French (new) For record r1: Term Frequency Fenix 1 8358 1 Sunset 1 and so on… American Step 1: Extract Term and Record Frequency Statistics 1 Fenix 8358 Sunset Blvd. West Hollywood 5 Fenix at the Argyle 8358 Sunset Blvd. W. Hollywood French (new) For record r1: For record r5: Term Frequency Term Frequency Fenix 1 Fenix 1 8358 1 at 1 Sunset 1 the 1 and so on… and so on… American Step 1: Extract Term and Record Frequency Statistics 1 Fenix 8358 Sunset Blvd. West Hollywood 5 Fenix at the Argyle 8358 Sunset Blvd. W. Hollywood French (new) For record r1: For record r5: American Record Frequency Term Frequency Term Frequency Term Frequency Fenix 1 Fenix 1 Fenix 2 8358 1 at 1 8358 2 Sunset 1 the 1 at 1 and so on… and so on… and so on… Step 2: Block each record on its tokens 1 Fenix 8358 Sunset Blvd. West Hollywood American 5 Fenix at the Argyle 8358 Sunset Blvd. W. Hollywood French (new) Treat each record like a bag of tokens Each token represents a block the record is placed in Step 2: Block each record on its tokens 1 Fenix 8358 Sunset Blvd. West Hollywood American 5 Fenix at the Argyle 8358 Sunset Blvd. W. Hollywood French (new) Record r1 has tokens Fenix 8358 Sunset Blvd. West Hollywood American Record r2 has tokens Fenix at the Argyle 8358 Sunset Blvd. W. Hollywood French new Step 2: Block each record on its tokens 1 Fenix 8358 Sunset Blvd. West Hollywood American 5 Fenix at the Argyle 8358 Sunset Blvd. W. Hollywood French (new) Record r1 has tokens Fenix 8358 Sunset Blvd. West Hollywood American Record r2 has tokens Fenix at the Argyle 8358 Sunset Blvd. W. Hollywood French new Step 3: Slide a window of size c over each block Consider block B with |B|=8 and c=3 1 2 3 4 5 6 7 8 Generate symmetric pairs: (r1,r2) (r2,r3) (r1,r3) Step 3: Slide a window of size c over each block Consider block B with |B|=8 and c=3 1 2 3 4 5 6 7 8 Compute log TF-IDF score: score(r1,r2) score(r2,r3) score(r1,r3) Step 3: Slide a window of size c over each block Consider block B with |B|=8 and c=3 1 2 3 4 5 6 7 8 Apply thresholds and determine if (pseudo) dup/non-dup: score(r1,r2)>ut? No Step 3: Slide a window of size c over each block Consider block B with |B|=8 and c=3 1 2 3 4 5 6 7 8 Apply thresholds and determine if (pseudo) dup/non-dup: score(r1,r2)>ut? No score(r1,r2)<lt? Yes Assign to non-dup set Step 3: Slide a window of size c over each block Consider block B with |B|=8 and c=3 1 2 3 4 5 6 7 8 Apply thresholds and determine if (pseudo) dup/non-dup: score(r2,r3)>ut? No score(r2,r3)<lt? No Do nothing Step 3: Slide a window of size c over each block Consider block B with |B|=8 and c=3 1 2 Apply thresholds and determine if (pseudo) dup/non-dup: 3 4 5 6 7 8 score(r1,r3)>ut? Yes Assign to dup set Step 3: Slide a window of size c over each block Slide the window by one record 1 2 3 4 5 6 7 8 Generate new pairs: (r2,r4) (r3,r4) Compute scores, apply thresholds… Step 3: Slide a window of size c over each block 1 2 3 4 5 6 7 8 …repeat all the way till the end Why windowing? • Without windowing, generating pseudo-set 2 O ( n ) in the worst case could take • Windowing guarantees O(n) (details in paper) Unsupervised learning algorithm Tabular dataset Generate pseudo training set Perform feature selection DNF Blocking Scheme Step 1: Extract binary features for each pair in pseudo-set 1 Fenix 8358 Sunset Blvd. West Hollywood 5 Fenix at the Argyle 8358 Sunset Blvd. W. Hollywood French (new) ExactMatch.Name? False American 0 Step 1: Extract binary features for each pair in pseudo-set 1 Fenix 8358 Sunset Blvd. West Hollywood 5 Fenix at the Argyle 8358 Sunset Blvd. W. Hollywood French (new) ExactMatch.Name? False CommonToken.Name? True American 0 1 Step 1: Extract binary features for each pair in pseudo-set 1 Fenix 8358 Sunset Blvd. West Hollywood American 5 Fenix at the Argyle 8358 Sunset Blvd. W. Hollywood French (new) ExactMatch.Name? False CommonToken.Name? True … 0 1 … CommonToken.Cuisine? False 0 Step 1: Extract binary features for each pair in pseudo-set 1 Fenix 8358 Sunset Blvd. West Hollywood American 5 Fenix at the Argyle 8358 Sunset Blvd. W. Hollywood French (new) Binary feature vector [0, 1, … , 0] obtained for pair Vector size equals the number of specific blocking predicates (=mp) Step 2: Collect feature vectors for pseudo-set [ [ [ … [ 0 , 1 , 1 , 1 , 0 , 1 , … , … , … , 0 ] 0 ] 1 ] 0 , 0 , … , 1 ] [ [ [ … [ 1 , 0 , 1 , 1 , 0 , 0 , … , … , … , 0 ] 1 ] 1 ] 0 , 0 , … , 0 ] Pseudo Duplicates Pseudo Non-Duplicates Step 3 (ideal): Choose minimum feature subset such that… • Chosen features leave at most duplicates uncovered • Minimum number of non-duplicates covered • Optimal solution is NP-hard Step 3 (reframed): Approximate solution by… • Using greedy algorithm+Fisher discrimination Step 3 (reframed): Approximate solution by… • Using greedy algorithm+Fisher discrimination • Greedy algorithm: iteratively pick highest scoring feature such that at least one new positive vector covered Step 3 (reframed): Approximate solution by… • Using greedy algorithm+Fisher discrimination • Greedy algorithm: iteratively pick highest scoring feature such that at least one new positive vector covered • Fisher score i of feature i is | D | ( D ,i i ) 2 | ND | ( ND,i i ) 2 i 2 | D | D2 ,i | ND | ND ,i Experiments Benchmarks Dataset Task True duplicate pairs Restaurant Number of tuples Deduplication 864 Cora Deduplication 1295 17184 Census Linkage 327 449+392 112 Performance: pseudo-training set generation Fig. 1: Precision of duplicates retrieved on Restaurant and Census Fig. 2: Precision of duplicates retrieved on Cora Precision of non-duplicates (up to 20,000) 100 percent on all datasets! Robustness to parameter settings Fig. 3: Precision of duplicates retrieved as ut is varied Fig. 4: Precision of non-duplicates retrieved as lt is varied No change as window size c varied from 20 to 50 Learning Schemes: Evaluation • Traditional metrics (recall, precision) don’t directly carry over • Consider full set of pairs and candidate set • Three metrics: – Reduction Ratio (RR): 1 (| | / | |) – Pairs Completeness (PC): | TP | / | TP | – Pairs Quality (PQ): | TP | / | | Performance: Learning disjunctive schemes Our (unsupervised) approach Supervised Baseline: Supervised algorithm by Bilenko and Mooney (2006) Performance: Learning DNF Schemes Our (unsupervised) approach Supervised Baseline: Supervised algorithm by Bilenko and Mooney (2006) Current Work • MapReduce implementation and integration into a complete deduplication system • Better feature discrimination criteria • Experiments on Big Data Current Work: Preliminary Results Prototype built and tested on some small datasets Dataset Precision Recall Census* (1000 records) 100.00 91.00 Census* (Noisy: 1000 records) 100.00 61.17 Restaurant 99.71 60.76 Can be run iteratively to get better recall Unsupervised Deduplication: we enable it Unsupervised Deduplication Dirty Data Kejriwal and Miranker: Unsupervised: Blocking Unsupervised: Identify Duplicates Clean Data Thank you