Slides - Mayank Kejriwal

advertisement
An Unsupervised Algorithm for
Learning Blocking Schemes
Mayank Kejriwal and Daniel P. Miranker
University of Texas at Austin
Unsupervised Deduplication: a 40 year
old problem, still open!
Clean Data
Dirty Data
Deduplication
No user input!
Deduplication: what is it?
ID
Name
Address
City
Cuisine
1
Fenix
8358 Sunset Blvd.
West
Hollywood
American
2
Art’s Delicatessen
12224 Ventura
Blvd.
Studio City
American
3
Hotel Bel-Air
701 Stone Canyon Bel Air
Rd.
Californian
4
Art’s Deli
12224 Ventura
Blvd.
Studio City
Delis
5
Fenix at the Argyle
8358 Sunset Blvd.
W.
Hollywood
French (new)
3 ‘entities’ being represented by 5 records!
Other names and variants: Record Linkage, Entity
Resolution…
Deduplication: what is it?
ID
Name
Address
City
Cuisine
1
Fenix
8358 Sunset Blvd.
West
Hollywood
American
2
Art’s Delicatessen
12224 Ventura
Blvd.
Studio City
American
3
Hotel Bel-Air
701 Stone Canyon Bel Air
Rd.
Californian
4
Art’s Deli
12224 Ventura
Blvd.
Studio City
Delis
5
Fenix at the Argyle
8358 Sunset Blvd.
W.
Hollywood
French (new)
Suppose we have boolean similarity function Sim(r,s) that
returns True iff Entity(r)=Entity(s)
Compare every record to every other?
What if…
ID
Name
Address
City
Cuisine
1
Fenix
8358 Sunset Blvd.
West
Hollywood
American
2
Art’s Delicatessen
12224 Ventura
Blvd.
Studio City
American
3
Hotel Bel-Air
701 Stone Canyon Bel Air
Rd.
Californian
4
Art’s Deli
12224 Ventura
Blvd.
Studio City
Delis
5
Fenix at the Argyle
8358 Sunset Blvd.
W.
Hollywood
French (new)
Suppose we group first on the Name attribute
ID
Name
Address
City
Cuisine
4
Art’s Deli
12224 Ventura
Blvd.
Studio City
Delis
2
Art’s Delicatessen
12224 Ventura
Blvd.
Studio City
American
1
Fenix
8358 Sunset Blvd.
West
Hollywood
American
5
Fenix at the Argyle 8358 Sunset Blvd.
W.
Hollywood
French (new)
3
Hotel Bel-Air
701 Stone Canyon Bel Air
Rd.
Californian
Only compare records within a group! Each such group
constitutes a block
Blocking
• Key idea: not every pair of records needs
comparing
• Effective grouping requires good blocking
scheme (e.g. Name)
Unsupervised Deduplication: a 40 year
old problem, still open!
Deduplication
Dirty Data
Blocking
Identify
Duplicates
Clean Data
Unsupervised Deduplication: current
state-of-the-art
Deduplication
Dirty Data
Supervised:
Blocking
Unsupervised:
Identify
Duplicates
Clean Data
Unsupervised Deduplication: our
contribution
Deduplication
Dirty Data
Kejriwal and
Miranker:
Unsupervised:
Blocking
Unsupervised:
Identify
Duplicates
Clean Data
Two key questions
• How do we construct the space of blocking
schemes?
• How do we efficiently learn a good blocking
scheme from that space?
Constructing the space of blocking
schemes
Give p General
Indexing
Functions
(GIFs)
Tokens(“Fenix at the Argyle”)={“Fenix”, “at”, “the”, “Argyle”}
1
Fenix
8358 Sunset Blvd. Hollywood
West
American
5
Fenix at the
Argyle
8358 Sunset Blvd. W.
Hollywood
French (new)
Constructing the space of blocking
schemes
Give p General
Indexing
Functions
(GIFs)
Construct mp
Specific
Indexing
Functions (SIFs)
Tokens.Name(r5)={“Fenix”, “at”, “the”, “Argyle”}
1
Fenix
8358 Sunset Blvd. Hollywood
West
American
5
Fenix at the
Argyle
8358 Sunset Blvd. W.
Hollywood
French (new)
Constructing the space of blocking
schemes
Give p General
Indexing
Functions
(GIFs)
Construct mp
Specific
Indexing
Functions (SIFs)
Construct mp
Specific
Blocking
Predicates
(SBPs)
CommonToken.Name(r1,r5)={True}
1
Fenix
8358 Sunset Blvd. Hollywood
West
American
5
Fenix at the
Argyle
8358 Sunset Blvd. W.
Hollywood
French (new)
Constructing the space of blocking
schemes
Give p General
Indexing
Functions
(GIFs)
Construct mp
Specific
Indexing
Functions (SIFs)
Construct Space of
2
2 mp
Construct mp
Specific
Blocking
Predicates
(SBPs)
DNF Blocking Schemes
DNF blocking schemes
Construct Space of
2
2 mp
DNF Blocking Schemes
Example 1: (ExactMatch.City AND CommonInteger.Address) OR ExactMatch.Cuisine
False!
1
Fenix
8358 Sunset Blvd.
West
Hollywood
American
5
Fenix at the
Argyle
8358 Sunset Blvd.
W. Hollywood French (new)
DNF blocking schemes
Construct Space of
Example 2:
2
2 mp
DNF Blocking Schemes
CommonToken.Name AND CommonInteger.Address
True
1
Fenix
8358 Sunset Blvd.
West
Hollywood
American
5
Fenix at the
Argyle
8358 Sunset Blvd.
W. Hollywood French (new)
Learning a scheme
Construct Space of
2
2 mp
DNF Blocking Schemes
Efficiently learn one scheme
Unsupervised learning algorithm
Tabular
dataset
Generate
pseudo training
set
Perform
feature
selection
DNF
Blocking
Scheme
Tabular
dataset
Generate
pseudo training
set
Perform
feature
selection
DNF
Blocking
Scheme
Step 1: Extract Term and Record Frequency
Statistics
1
Fenix
8358 Sunset Blvd.
West
Hollywood
5
Fenix at the
Argyle
8358 Sunset Blvd.
W. Hollywood French (new)
For record r1:
Term
Frequency
Fenix
1
8358
1
Sunset
1
and so on…
American
Step 1: Extract Term and Record Frequency
Statistics
1
Fenix
8358 Sunset Blvd.
West
Hollywood
5
Fenix at the
Argyle
8358 Sunset Blvd.
W. Hollywood French (new)
For record r1:
For record r5:
Term
Frequency
Term
Frequency
Fenix
1
Fenix
1
8358
1
at
1
Sunset
1
the
1
and so on…
and so on…
American
Step 1: Extract Term and Record Frequency
Statistics
1
Fenix
8358 Sunset Blvd.
West
Hollywood
5
Fenix at the
Argyle
8358 Sunset Blvd.
W. Hollywood French (new)
For record r1:
For record r5:
American
Record Frequency
Term
Frequency
Term
Frequency
Term
Frequency
Fenix
1
Fenix
1
Fenix
2
8358
1
at
1
8358
2
Sunset
1
the
1
at
1
and so on…
and so on…
and so on…
Step 2: Block each record on its tokens
1
Fenix
8358 Sunset Blvd.
West
Hollywood
American
5
Fenix at the
Argyle
8358 Sunset Blvd.
W. Hollywood French (new)
Treat each record like a bag of tokens
Each token represents a block the record is placed in
Step 2: Block each record on its tokens
1
Fenix
8358 Sunset Blvd.
West
Hollywood
American
5
Fenix at the
Argyle
8358 Sunset Blvd.
W. Hollywood French (new)
Record r1 has tokens
Fenix
8358
Sunset Blvd.
West
Hollywood
American
Record r2 has tokens
Fenix
at the Argyle
8358 Sunset Blvd.
W.
Hollywood
French
new
Step 2: Block each record on its tokens
1
Fenix
8358 Sunset Blvd.
West
Hollywood
American
5
Fenix at the
Argyle
8358 Sunset Blvd.
W. Hollywood French (new)
Record r1 has tokens
Fenix
8358
Sunset Blvd.
West
Hollywood
American
Record r2 has tokens
Fenix
at the Argyle
8358 Sunset Blvd.
W.
Hollywood
French
new
Step 3: Slide a window of size c over
each block
Consider block B with |B|=8 and c=3
1
2
3
4
5
6
7
8
Generate symmetric pairs:
(r1,r2)
(r2,r3)
(r1,r3)
Step 3: Slide a window of size c over
each block
Consider block B with |B|=8 and c=3
1
2
3
4
5
6
7
8
Compute log TF-IDF score:
score(r1,r2)
score(r2,r3)
score(r1,r3)
Step 3: Slide a window of size c over
each block
Consider block B with |B|=8 and c=3
1
2
3
4
5
6
7
8
Apply thresholds and
determine if (pseudo)
dup/non-dup:
score(r1,r2)>ut? No
Step 3: Slide a window of size c over
each block
Consider block B with |B|=8 and c=3
1
2
3
4
5
6
7
8
Apply thresholds and
determine if (pseudo)
dup/non-dup:
score(r1,r2)>ut? No
score(r1,r2)<lt? Yes
Assign to non-dup set
Step 3: Slide a window of size c over
each block
Consider block B with |B|=8 and c=3
1
2
3
4
5
6
7
8
Apply thresholds and
determine if (pseudo)
dup/non-dup:
score(r2,r3)>ut? No
score(r2,r3)<lt? No
Do nothing
Step 3: Slide a window of size c over
each block
Consider block B with |B|=8 and c=3
1
2
Apply thresholds and
determine if (pseudo)
dup/non-dup:
3
4
5
6
7
8
score(r1,r3)>ut? Yes
Assign to dup set
Step 3: Slide a window of size c over
each block
Slide the window by one record
1
2
3
4
5
6
7
8
Generate new pairs:
(r2,r4)
(r3,r4)
Compute scores, apply
thresholds…
Step 3: Slide a window of size c over
each block
1
2
3
4
5
6
7
8
…repeat all the way
till the end
Why windowing?
• Without windowing, generating pseudo-set
2
O
(
n
) in the worst case
could take
• Windowing guarantees O(n) (details in paper)
Unsupervised learning algorithm
Tabular
dataset
Generate
pseudo training
set
Perform
feature
selection
DNF
Blocking
Scheme
Step 1: Extract binary features for each
pair in pseudo-set
1
Fenix
8358 Sunset Blvd.
West
Hollywood
5
Fenix at the
Argyle
8358 Sunset Blvd.
W. Hollywood French (new)
ExactMatch.Name? False
American
0
Step 1: Extract binary features for each
pair in pseudo-set
1
Fenix
8358 Sunset Blvd.
West
Hollywood
5
Fenix at the
Argyle
8358 Sunset Blvd.
W. Hollywood French (new)
ExactMatch.Name? False
CommonToken.Name? True
American
0
1
Step 1: Extract binary features for each
pair in pseudo-set
1
Fenix
8358 Sunset Blvd.
West
Hollywood
American
5
Fenix at the
Argyle
8358 Sunset Blvd.
W. Hollywood French (new)
ExactMatch.Name? False
CommonToken.Name? True
…
0
1
…
CommonToken.Cuisine? False
0
Step 1: Extract binary features for each
pair in pseudo-set
1
Fenix
8358 Sunset Blvd.
West
Hollywood
American
5
Fenix at the
Argyle
8358 Sunset Blvd.
W. Hollywood French (new)
Binary feature vector [0, 1, … , 0] obtained for pair
Vector size equals the number of specific blocking
predicates (=mp)
Step 2: Collect feature vectors for pseudo-set
[
[
[
…
[
0 ,
1 ,
1 ,
1 ,
0 ,
1 ,
… ,
… ,
… ,
0 ]
0 ]
1 ]
0 ,
0 ,
… ,
1 ]
[
[
[
…
[
1 ,
0 ,
1 ,
1 ,
0 ,
0 ,
… ,
… ,
… ,
0 ]
1 ]
1 ]
0 ,
0 ,
… ,
0 ]
Pseudo Duplicates
Pseudo Non-Duplicates
Step 3 (ideal): Choose minimum feature subset
such that…
• Chosen features leave at most  duplicates
uncovered
• Minimum number of non-duplicates covered
• Optimal solution is NP-hard
Step 3 (reframed): Approximate solution by…
• Using greedy algorithm+Fisher discrimination
Step 3 (reframed): Approximate solution by…
• Using greedy algorithm+Fisher discrimination
• Greedy algorithm: iteratively pick highest
scoring feature such that at least one new
positive vector covered
Step 3 (reframed): Approximate solution by…
• Using greedy algorithm+Fisher discrimination
• Greedy algorithm: iteratively pick highest
scoring feature such that at least one new
positive vector covered
• Fisher score  i of feature i is
| D | (  D ,i  i ) 2  | ND | (  ND,i  i ) 2
i 
2
| D |  D2 ,i  | ND |  ND
,i
Experiments
Benchmarks
Dataset
Task
True duplicate pairs
Restaurant
Number of
tuples
Deduplication 864
Cora
Deduplication 1295
17184
Census
Linkage
327
449+392
112
Performance: pseudo-training set
generation
Fig. 1: Precision of duplicates retrieved
on Restaurant and Census
Fig. 2: Precision of duplicates
retrieved on Cora
Precision of non-duplicates (up to 20,000) 100 percent on all datasets!
Robustness to parameter settings
Fig. 3: Precision of duplicates
retrieved as ut is varied
Fig. 4: Precision of non-duplicates
retrieved as lt is varied
No change as window size c varied from 20 to 50
Learning Schemes: Evaluation
• Traditional metrics (recall, precision) don’t directly carry over
• Consider full set of pairs  and candidate set 
• Three metrics:
– Reduction Ratio (RR): 1  (|  | / |  |)
– Pairs Completeness (PC): | TP | / | TP |
– Pairs Quality (PQ): | TP | / |  |
Performance: Learning disjunctive schemes
Our (unsupervised) approach
Supervised
Baseline: Supervised algorithm by Bilenko and Mooney (2006)
Performance: Learning DNF Schemes
Our (unsupervised) approach
Supervised
Baseline: Supervised algorithm by Bilenko and Mooney (2006)
Current Work
• MapReduce implementation and integration
into a complete deduplication system
• Better feature discrimination criteria
• Experiments on Big Data
Current Work: Preliminary Results
Prototype built and tested on some small datasets
Dataset
Precision
Recall
Census* (1000 records)
100.00
91.00
Census* (Noisy: 1000
records)
100.00
61.17
Restaurant
99.71
60.76
Can be run iteratively to get better recall
Unsupervised Deduplication: we enable it
Unsupervised Deduplication
Dirty Data
Kejriwal and
Miranker:
Unsupervised:
Blocking
Unsupervised:
Identify
Duplicates
Clean Data
Thank you
Download