Large-Scale Deduplication with Constraints using Dedupalog Arvind Arasu et al. Definitions Deduplication: the process of identifying references in data records that refer to the same real-world entity. Collective Deduplication: a generalization of deduplication in which one wants to find types of realworld entities in a set of records that are related. Weaknesses of Prior Approaches Allow clustering of only a single entity type in isolation—can’t answer queries such as: how many distinct papers were in ICDE 2008? Ignore Constraints Use Constraints in an ad-hoc way which prevents users from flexibly combining constraints. Constraints “ICDE” and “Conference on Data Engineering” are the same conference. conferences in different cities, are in different years. author references that do not share any common coauthors do not refer to the same author. Dedupalog Collective deduplication Declarative Domain independent Expressive enough to encode many constraints (hard and soft) Scales to large datasets Dedupalog 1. 2. 3. 4. Provide a set of input tables that contain references to be deduplicated and other useful info (e.g. results of similarity computation) Define a list of entity references to deduplicate (e.g. authors, papers, publishers) Define a Dedupalog program Execute the Dedupalog program Notation Entity reference: Clustering Relation (these are duplicates): Input Tables (example) Entity Reference Tables (example) Dedupalog Program (example) *conflicts that occur from the rules are detected by the system and reported to the user Soft-complete Rules “papers with similar titles are likely duplicates” Paper references whose titles appear in TitleSimilar are likely to be clustered together. Paper references whose titles do not appear in TitleSimilar are not likely to be clustered together. Soft-incomplete Rules “papers with very similar titles are likely duplicates” Paper references whose titles appear in TitleVerySimilar are likely to be clustered together. *This rule says nothing about paper references whose titles do not appear in TitleVerySimilar. Hard Rules “the publisher references listed in the table PublisherEQ must be clustered together” “the publisher references in PublisherNEQ must not be clustered together” must-link cannot-link Hard rules may only contain a positive body. Complex Hard Rules “whenever we cluster two papers, we must also cluster the publishers of those papers” Such constraints are central to collective deduplication At most one entity reference is allowed in the body of the rule as in this example. Complex Negative Rules “two distinct author references on a single paper cannot be the same person” Recursive Rules “Authors that do not share common coauthors are unlikely to be duplicates” These constraints require inspecting the current clustering---thus recursion. Recursion is only allowed in soft rules. Cost If gamma is soft-complete, then its cost on J* is: The cost of a clustering, J*, is the number of tuples on which the constraint, gamma, and the clustering disagree. Cost The cost is 2 because of two violations: 1) d belongs in the same cluster as c 2) c does not belong in the same cluster as b * The Goal of Dedupalog is to incur the minimum cost. Main Algorithm Turns out this is NP-hard, even for a single soft-complete constraint. For a large fragment of Dedupalog, the following algorithm is a constant factor approximation of the optimal. Clustering Graphs Clustering graph is a pair (V,Phi) V is a set of nodes that correspond to an entity reference. Phi is a symmetric function that assigns pairs of nodes that make up edges, (u,v), to labels: [+] : soft-plus [-] : soft-minus [=] : hard-plus [] : hard-minus Clustering Graphs Uniformly choose a random permutation of the nodes This gives a partial order on the edges Harden each edge in order: Change soft edges into hard edges Apply these two rules: A clustering is all [=] connected components Guarantees: Creating the Clustering Graph Perform forward voting Perform Backward-propagation Creates clustering graph for each entity reference relation (i.e. Publisher!, Paper!, and Author!) Physical Implementation and Optimization Implicit Representation of Edges Choosing Edge Orderings implicitly store some edge values order edges so that [+] edges are processed first Sort Optimization Experiments—Cora dataset •Standard: matching titles and running correlated clustering •NEQ: Standard + an additional hard rule constraint—lists conference papers that were known to be distinct from their journal papers Experiments—Cora dataset •Standard: Clustering based on string similarity •Soft Constraint: Standard + soft constraint—”papers must be in a single conference” •Hard Constraint: Standard + hard constraint—”papers must be in a single conference” Experiments—ACM dataset ACM spanning 1988-1990 ACM has 436,000 references •No Constraints: String similarity and correlation clustering •Constraints: No Constraints + hard constraint—”references with different years, do not refer to the same conference” Experiments—ACM dataset Helps catches errors in records. Added a hard rule that says: “If two references refer to the same paper, then they must refer to the same conference” On the subset it found 5 references that contained incorrect years On the full dataset it found 152 suspect papers Performance Performance •Vanilla: Clustering of the references by conference with a single soft constraint •[=]: Vanilla with two additional hard constraints •HMorphism: Vanilla + [=] + Cluster conferences and papers with the constraint—”conference papers appear in only one conference” •NoStream: Vanilla with sort-optimization off Interactive Deduplication Manual clustering of Cora took a couple of hours—98% precision and recall. Obtaining ground truth for the ACM subset to only 4 hours Conclusions Proposed a novel language, Dedupalog Validated its practicality on two datasets Proved that a large syntactic fragment of Dedupalog has a constant factor approximation algorithm using a novel algorithm Strengths Convey their language and algorithm effectively Mostly good examples that help readers understand their contribution Strong and meaningful results Good contribution Weaknesses Some mislabeled figure references and occasional typos can cause grief.