ppt

advertisement
Large-Scale Deduplication with
Constraints using Dedupalog
Arvind Arasu et al.
Definitions


Deduplication: the process of
identifying references in data records
that refer to the same real-world entity.
Collective Deduplication: a
generalization of deduplication in
which one wants to find types of realworld entities in a set of records that
are related.
Weaknesses of Prior
Approaches



Allow clustering of only a single entity
type in isolation—can’t answer queries
such as: how many distinct papers
were in ICDE 2008?
Ignore Constraints
Use Constraints in an ad-hoc way
which prevents users from flexibly
combining constraints.
Constraints



“ICDE” and “Conference on Data
Engineering” are the same conference.
conferences in different cities, are in
different years.
author references that do not share
any common coauthors do not refer to
the same author.
Dedupalog





Collective deduplication
Declarative
Domain independent
Expressive enough to encode many
constraints (hard and soft)
Scales to large datasets
Dedupalog
1.
2.
3.
4.
Provide a set of input tables that contain
references to be deduplicated and other
useful info (e.g. results of similarity
computation)
Define a list of entity references to
deduplicate (e.g. authors, papers,
publishers)
Define a Dedupalog program
Execute the Dedupalog program
Notation
Entity reference:
Clustering Relation (these are duplicates):
Input Tables (example)
Entity Reference Tables
(example)
Dedupalog Program (example)
*conflicts that occur from the rules are detected by the
system and reported to the user
Soft-complete Rules

“papers with similar titles are likely
duplicates”
Paper references whose titles appear in TitleSimilar are
likely to be clustered together.
Paper references whose titles do not appear in TitleSimilar
are not likely to be clustered together.
Soft-incomplete Rules

“papers with very similar titles are
likely duplicates”
Paper references whose titles appear in TitleVerySimilar are
likely to be clustered together.
*This rule says nothing about paper references whose titles
do not appear in TitleVerySimilar.
Hard Rules


“the publisher references listed in the table
PublisherEQ must be clustered together”
“the publisher references in PublisherNEQ must
not be clustered together”
must-link
cannot-link
Hard rules may only contain a positive body.
Complex Hard Rules

“whenever we cluster two papers, we must
also cluster the publishers of those papers”
Such constraints are central to collective deduplication
At most one entity reference is allowed in the body of
the rule as in this example.
Complex Negative Rules

“two distinct author references on a single
paper cannot be the same person”
Recursive Rules

“Authors that do not share common
coauthors are unlikely to be duplicates”
These constraints require inspecting the current
clustering---thus recursion.
Recursion is only allowed in soft rules.
Cost
If gamma is soft-complete, then its cost on J* is:
The cost of a clustering, J*, is the number of tuples on which the
constraint, gamma, and the clustering disagree.
Cost
The cost is 2 because of two violations:
1) d belongs in the same cluster as c
2) c does not belong in the same cluster as b
* The Goal of Dedupalog is to incur the minimum cost.
Main Algorithm
Turns out this is NP-hard, even for a single soft-complete constraint.
For a large fragment of Dedupalog, the following algorithm is a
constant factor approximation of the optimal.
Clustering Graphs



Clustering graph is a pair (V,Phi)
V is a set of nodes that correspond to an
entity reference.
Phi is a symmetric function that assigns
pairs of nodes that make up edges, (u,v), to
labels:




[+] : soft-plus
[-] : soft-minus
[=] : hard-plus
[] : hard-minus
Clustering Graphs




Uniformly choose a random permutation of the
nodes
This gives a partial order on the edges
Harden each edge in order:
 Change soft edges into hard edges
 Apply these two rules:
A clustering is all [=] connected components
Guarantees:
Creating the Clustering Graph



Perform forward voting
Perform Backward-propagation
Creates clustering graph for each
entity reference relation (i.e.
Publisher!, Paper!, and Author!)
Physical Implementation and
Optimization

Implicit Representation of Edges


Choosing Edge Orderings


implicitly store some edge values
order edges so that [+] edges are
processed first
Sort Optimization
Experiments—Cora dataset
•Standard: matching titles and running correlated clustering
•NEQ: Standard + an additional hard rule constraint—lists
conference papers that were known to be distinct from their journal
papers
Experiments—Cora dataset
•Standard: Clustering based on string similarity
•Soft Constraint: Standard + soft constraint—”papers must be
in a single conference”
•Hard Constraint: Standard + hard constraint—”papers must
be in a single conference”
Experiments—ACM dataset
ACM spanning 1988-1990
ACM has 436,000 references
•No Constraints: String similarity and correlation clustering
•Constraints: No Constraints + hard constraint—”references with
different years, do not refer to the same conference”
Experiments—ACM dataset




Helps catches errors in records.
Added a hard rule that says: “If two references refer
to the same paper, then they must refer to the same
conference”
On the subset it found 5 references that contained
incorrect years
On the full dataset it found 152 suspect papers
Performance
Performance
•Vanilla: Clustering of the references by conference with a single
soft constraint
•[=]: Vanilla with two additional hard constraints
•HMorphism: Vanilla + [=] + Cluster conferences and papers with
the constraint—”conference papers appear in only one conference”
•NoStream: Vanilla with sort-optimization off
Interactive Deduplication


Manual clustering of Cora took a
couple of hours—98% precision and
recall.
Obtaining ground truth for the ACM
subset to only 4 hours
Conclusions



Proposed a novel language,
Dedupalog
Validated its practicality on two
datasets
Proved that a large syntactic fragment
of Dedupalog has a constant factor
approximation algorithm using a novel
algorithm
Strengths




Convey their language and algorithm
effectively
Mostly good examples that help
readers understand their contribution
Strong and meaningful results
Good contribution
Weaknesses

Some mislabeled figure references
and occasional typos can cause grief.
Download