as a PDF

advertisement
AN ADAPTIVE AND EFFICIENT ALGORITHM FOR DETECTING
APPROXIMATELY DUPLICATE DATABASE RECORDS
Alvaro E. Monge1
California State University, Long Beach, CECS Department, Long Beach, CA, 90840-8302
June 9, 2000
Abstract | The integration of information is an important area of research in databases. By com-
bining multiple information sources, a more complete and more accurate view of the world is attained,
and additional knowledge gained. This is a non-trivial task however. Often there are many sources
which contain information about a certain kind of entity, and some will contain records concerning
the same real-world entity. Furthermore, one source may not have the exact information that another
source contains. Some of the information may be dierent due to data entry errors for example or may
be missing altogether. Thus, one problem in integrating information sources is to identify possibly
dierent designators of the same entity. Data cleansing is the process of purging databases of inaccurate or inconsistent data. The data is typically manipulated into a form which is useful for other
tasks, such as data mining. This paper addresses the data cleansing problem of detecting database
records that are approximate duplicates, but not exact duplicates. An ecient algorithm is presented
which combines three key ideas. First, the Smith-Waterman algorithm for computing the minimum
edit-distance is used as a domain-independent method to recognize pairs of approximately duplicates.
Second, the union-nd data structure is used to maintain the clusters of duplicate records incrementally, as pairwise duplicate relationships are discovered. Third, the algorithm uses a priority queue
of cluster subsets to respond adaptively to the size and homogeneity of the clusters discovered as the
database is scanned. This results in signicant savings in the number of times that a pairwise record
matching algorithm is applied, without impairing accuracy. Comprehensive experiments on synthetic
databases and on a real-world database conrm the eectiveness of all three ideas.
Key words: merge/purge, data cleansing, approximate, duplicate, database, transitive closure, unionnd, Smith-Waterman, edit distance
1. INTRODUCTION
Research in the areas of knowledge discovery and data cleansing has seen recent growth. The
growth is due to a number of reasons. The most obvious one is the exponential growth of information available online. In particular, the biggest impact comes from the popularity of the Internet
and the World Wide Web (WWW or web). In addition to the web, there are many traditional
sources of information, like relational databases, which have impacted the growth of the information available online. The availability of these sources increases not only the amount of data,
but also the variety of and quality in which such data appears. These factors create a number of
problems. The work in this paper concentrates on one such problem: the detection of multiple
representations of a single entity.
Data cleansing is the process of cleaning up databases containing inaccurate or inconsistent
data. One inconsistency is the existence of dierent multiple representations of the same real-world
entity. The task is to detect such duplicity and reconcile the dierences into a single representation.
The dierences may be due to data entry errors such as typographical mistakes, or to unstandardized abbreviations, or to dierences in detailed schemas of records from multiple databases, among
other reasons. As the information in multiple sources is integrated, the same real-world entity is
duplicated. The detection of records that are approximate duplicates, but not exact duplicates,
in databases is an important task. Without a solution to this problem many of the data mining
algorithms would be rendered useless as they depend on the quality of data being mined. This
paper presents solutions to this problem.
Every duplicate detection method proposed to date, including ours, requires an algorithm for
detecting \is a duplicate of" relationships between pairs of records. Section 2 summarizes an algorithm used to determine if two records represent the same entity. Such record matching algorithms
1
2
A. E. Monge
are used in database-level duplicate detection algorithms which are presented in Section 3. The
section starts out by dening the problem and identifying related work in this area. Typically
the record matching algorithms are relatively expensive computationally, and the database-level
duplicate detection algorithms will have grouping methods to reduce the number of times that it
must be applied. This is the major contribution of the work presented in this article and is presented in Sections 3.5 and 3.6 Section 4 provides an empirical evaluation of the duplicate detection
algorithms, including a comparison with previous work. The article concludes in Section 6 with
nal remarks about this work.
2. ALGORITHMS TO MATCH RECORDS
Many knowledge discovery and database mining applications need to combine information
from heterogeneous sources. These information sources, such as relational databases or worldwide
web pages, provide information about the same real-world entities, but describe these entities
dierently. Resolving discrepancies in how entities are described is the problem addressed in this
section. Specically, the record matching problem is to determine whether or not two syntactically
dierent record values describe the same semantic entity, i.e. real-world object.
Solving the record matching problem is vital in three major knowledge discovery tasks.
First, the ability to perform record matching allows one to identify corresponding information
in dierent information sources. This allows one to navigate from one source to another,
and to combine information from the sources. In relational databases, navigating from one
relation to another is called a \join." Record matching allows one to do joins on information
sources that are not relations in the strict sense. A worldwide web knowledge discovery
application that uses record matching to join separate Internet information sources, called
WebFind, is described in [31, 33, 34].
Second, the ability to do record matching allows one to detect duplicate records, whether in
one database or in multiple related databases. Duplicate detection is the central issue in the
so-called \Merge/Purge" task [18, 21, 35], which is to identify and combine multiple records,
from one database or many, that concern the same entity but are distinct because of data
entry errors. This task is also called \data scrubbing" or \data cleaning" or \data cleansing"
[41]. The detection problem is the focus of this article and is studied in more detail in
Section 3. This article does not propose solutions to the question of what is to be done once
the duplicate records are detected.
Third, doing record matching is one way to solve the database schema matching problem
[3, 25, 29, 44, 30]. This problem is to infer which attributes in two dierent databases
(i.e. which columns of which relations for relational databases) denote the same real-world
properties or objects. If several values of one attribute can be matched pairwise with values
of another attribute, then one can infer inductively that the two attributes correspond. This
technique is used to do schema matching for Internet information sources by the \information
learning agent" (ILA) of [13], for example.
The remaining of this section discusses the related work in the area of record matching by
rst stating the record matching problem precisely. Finally, the section briey summarizes the
domain-independent record matching algorithms proposed in [32].
2.1. Dening the problem
The record matching problem has been recognized as important for at least 50 years. Since the
1950s over 100 papers have studied matching for medical records under the name \record linkage."
These papers are concerned with identifying medical records for the same individual in dierent
databases, for the purpose of performing epidemiological studies [37]. Record matching has also
been recognized as important in business for decades. For example tax agencies must do record
matching to correlate dierent pieces of information about the same taxpayer when social security
Detecting approximately duplicate records
3
numbers are missing or incorrect. The earliest paper on duplicate detection in a business database
is by [48]. The \record linkage" problem in business has been the focus of workshops sponsored by
the US Census Bureau [24, 8, 11, 46, 47]. Record matching is also useful for detecting fraud and
money laundering [40].
Almost all published previous work on record matching is for specic application domains,
and hence gives domain-specic algorithms. For example, three papers discuss record matching
for customer addresses [1], census records [42], or variant entries in a lexicon [22]. Other work
on record matching is not domain-specic, but assumes that domain-specic knowledge will be
supplied by a human for each application domain [45, 18].
One important area of research that is relevant to approximate record matching is approximate
string matching. String matching has been one of the most studied problems in computer science
[5, 26, 17, 15, 9, 12]. The main approach is based on edit distance [28]. Edit distance is the minimum
number of operations on individual characters (e.g. substitutions, insertions, and deletions) needed
to transform one string of symbols to another [39, 17, 27]. In the survey by [17], the authors consider
two dierent problems, one under the denition of equivalence and a second using similarity.
Their denition of equivalence allows only small dierences in the two strings. For examples, they
allow alternate spellings of the same word, and ignore the case of letters. The similarity problem
allows for more errors, such as those due to typing: transposed letters, missing letters, etc. The
equivalence of strings is the same as the mathematical notion of equivalence, it always respects the
reexivity, symmetry, and transitivity property. The similarity problem on the other hand, is the
more dicult problem, where any typing and spelling errors are allowed. The similarity problem
then is not necessarily transitive; while it still respects the reexivity and symmetry properties.
2.2. Proposed algorithm
The word record is used to mean a syntactic designator of some real-world object, such as a
tuple in a relational database. The record matching problem arises whenever records that are not
identical, in a bit-by-bit sense, may still refer to the same object. For example, one database may
store the rst name and last name of a person (e.g. \Jane Doe"), while another database may store
only the initials and the last name of the person (e.g. \J. B. Doe").
In this work, we say that two records are equivalent if they are equal semantically, that is if
they both designate the same real-world entity. Semantically, this problem respects the reexivity,
symmetry, and transitivity properties. The record matching algorithms which solve this problem
depend on the syntax of the records. These syntactic calculations are approximations of what we
really want, semantic equivalence. In such calculations, errors are bound to occur and thus the
semantic equivalence will not be properly calculated. However, the claim is that there are few
errors and that the approximation is good. The experiments from Section 4 will provide evidence
for this claim.
Equivalence may sometimes be a question of degree, so a function solving the record matching
problem returns a value between 0.0 and 1.0, where 1.0 means certain equivalence and 0.0 means
certain non-equivalence. This study assumes that these scores are ordinal, but not that they have
any particular scalar meaning. Degree of match scores are not necessarily probabilities or fuzzy
degrees of truth. An application will typically just compare scores to a threshold that depends on
the domain and the particular record matching algorithm in use.
Record matching algorithms vary by the amount of domain-specic knowledge that they use.
The pairwise record matching algorithms used in most previous work have been application-specic.
For example, in [18], the authors use production rules based on domain-specic knowledge, which
are rst written in OPS5 [7] { a programming language for rule-based production systems used
primarily in articial intelligence { and then translated by hand into C. This section presents algorithms for pairwise record matching which are relatively domain independent. In particular, this
work proposes to use a generalized edit-distance algorithm. This domain-independent algorithm
is a variant of the well-known Smith-Waterman algorithm [43], which was originally developed for
nding evolutionary relationships between biological protein or DNA sequences.
A record matching algorithm is domain-independent if it can be used without any modica-
4
A. E. Monge
tions in a range of applications. By this denition, the Smith-Waterman algorithm is domainindependent under the assumptions that records have similar schemas and that records are made
up of alphanumeric characters. The rst assumption is needed because the Smith-Waterman algorithm does not address the problem of duplicate records containing elds which are transposed.y
The second assumption is needed because any edit-distance algorithm assumes that records are
strings over some xed alphabet of symbols. Naturally this assumption is true for a wide range
of databases, including those with numerical elds such as social security numbers that are represented in decimal notation.
2.3. The Smith-Waterman algorithm
Given two strings of characters, the Smith-Waterman algorithm [43] uses dynamic programming
to nd the lowest cost series of changes that converts one string into the other, i.e. the minimum
\edit distance" weighted by cost between the strings. Costs for individual changes, which are mutations, insertion, or deletions, are parameters of the algorithm. Although edit-distance algorithms
have been used for spelling correction and other text applications before, this work is the rst to
show how to use an edit-distance method eectively for general textual record matching.
For matching textual records, we dene the alphabet to be the lower case and upper case
alphabetic characters, the ten digits, and three punctuation symbols space, comma, and period.
All other characters are removed before applying the algorithm. This particular choice of alphabet
is not critical.
The Smith-Waterman algorithm has three parameters m, s, and c. Given the alphabet , m
is a jj jj matrix of match scores for each pair of symbols in the alphabet. The matrix m has
entries for exact matches, for approximate matches, as well as for non-matches of two symbols in
the alphabet. In the original Smith-Waterman algorithm, this matrix models the mutations that
occur in nature. In this work, the matrix tries to account for typical phoneme and typing errors
that occur when a record is entered into a database.
Much of the power of the Smith-Waterman algorithm is due to its ability to introduce gaps in
the records. A gap is a sequence of non-matching symbols; these are seen as dashes in the example
alignments of Figure 1. The Smith-Waterman algorithm has two parameters which aect the start
and length of the gaps. The scalar s is the cost of starting a gap in an alignment, while c is the cost
of continuing a gap. The ratios of these parameters strongly aect the behavior of the algorithm.
For example if the gap penalties are such that it is relatively inexpensive to continue a gap (c < s)
then the Smith-Waterman algorithm prefers a single long gap over many short gaps. Intuitively,
since the Smith-Waterman algorithm allows for gaps of unmatched characters, it should cope well
with many abbreviations. It should also perform well when records have small pieces of missing
information or minor syntactical dierences, including typographical mistakes.
The Smith-Waterman algorithm works by computing a score matrix E. One of the strings
is placed along the horizontal axis of the matrix, while the second string goes along the vertical
axis. An entry E(i; j) in this matrix is the best possible matching score between the prex 1 : : :i
of one string and the prex 1 : : :j of the second string. When the prexes (or the entire strings)
match exactly, then the optimal alignment can be found along the main diagonal. For approximate
matches, the optimal alignment is within a small distance of the diagonal. Formally, the value of
E(i; j) is
8 E(i ? 1; j ? 1)+ m(letter(i); letter(j))
>>
< E(i ? 1; j) + c if align(i ? 1; j ? 1) ends in a gap
E(i; j) = max E(i ? 1; j) + s if align(i ? 1; j ? 1) ends in a match
>> E(i; j ? 1) + c if align(i ? 1; j ? 1) ends in a gap
: E(i; j ? 1) + s if align(i ? 1; j ? 1) ends in a match
All experiments reported in this paper use the same Smith-Waterman algorithm with the same
gap penalties and match matrix. The parameter values were determined using a small set of
y Technically, a variant of the Needleman-Wunsch [36] algorithm is actually used, which calculates the minimum
weighted edit-distance between two entire strings. Given two strings, the better-known Smith-Waterman algorithm
nds a substring in each string such that the pair of substrings has minimum weighted edit-distance.
Detecting approximately duplicate records
5
department- of chemical engineering, stanford university, ca------lifornia
Dep------t. of Chem---. Eng-------., Stanford Univ-----., CA, USA.
psychology department, stanford univ-----------ersity, palo alto, calif
Dept. of Psychol-------------., Stanford Univ., CA, USA.
Fig. 1: Optimal record alignments produced by the Smith-Waterman algorithm.
aliation records. The experiments showed that the values chosen were intuitively reasonable and
provided good results. The match score matrix is symmetric with all entries ?3 except that an
exact match scores 5 (regardless of case) and approximate matches score 3. An approximate match
occurs between two characters if they are both in one of the sets fd tg fg jg fl rg fm ng fb p vg fa
e i o ug f, .g.
The penalties for starting and continuing a gap are 5 and 1 respectively. The informal experiments just mentioned show that the penalty to start a gap should be similar in absolute magnitude
to the score of an exact match between two letters, while the penalty to continue a gap should be
smaller than the score of an approximate match. If these conditions are met, the accuracy of the
Smith-Waterman algorithm is nearly unaected by the precise values of the gap penalties. The
experiments varied the penalty for starting gaps by considering values smaller and greater than
an exact match. Similarly, the penalty to continue a gap was varied by considering values greater
than 0.0. The nal score calculated by the algorithm is normalized to range between 0.0 and 1.0
by dividing by 5 times the length of the smaller of the two records being compared.
Figure 1 shows two typical optimal alignments produced by the Smith-Waterman algorithm
with the choice of parameter values described. The records shown are taken from datasets used
in experiments for measuring the accuracy of the Smith-Waterman algorithm and other record
matching algorithms [32]. These examples show that with the chosen values for the gap penalties,
the algorithm detects abbreviations by introducing gaps where appropriate. The second pair of
records also shows the inability of the Smith-Watermanalgorithm to match out-of-order subrecords.
The Smith-Waterman algorithm uses dynamic programming and its running time is proportional to the product of the lengths of its input strings. This quadratic time complexity is similar
to that of other more basic record matching algorithms [32]. The Smith-Waterman algorithm is
symmetric: the score of matching record A to B is the same as the score of matching B to A.
Symmetry may be a natural requirement for some applications of record matching but not for
others. For example, the name \Alvaro E. Monge" matches \A. E. Monge" while the reverse is
not necessarily true.
3. ALGORITHMS TO DETECT DUPLICATE DATABASE RECORDS
This section considers the problem of detecting when records in a database are duplicates of each
other, even if they are not textually identical. If these multiple duplicate records concern the same
real-world entity, they must be detected in order to have a consistent database. Multiple records
for a single entity may exist because of typographical data entry errors, because of unstandardized
abbreviations, or because of dierences in detailed schemas of records from multiple databases,
among other reasons. Thus, the problem is one of consolidating the records in these databases, so
that an entity is represented by a single record. This is a necessary and crucial preprocessing step in
data warehousing and data mining applications where data is collected from many dierent sources
and inconsistencies can lead to erroneous results. Before performing data analysis operations, the
data must be preprocessed and organized into a consistent form.
3.1. Related work
The duplicate detection problem is dierent from, but related to, the schema matching problem
[3, 25, 29, 44]. That problem is to nd the correspondence between the structure of records in one
database and the structure of records in a dierent database. The problem of actually detecting
6
A. E. Monge
matching records still exists even when the schema matching problem has been solved. For example,
consider records from dierent databases that include personal names. The fact that there are
personal name attributes in each record is detected by schema matching. However record-level
approximate duplicate detection is still needed in order to combine dierent records concerning
the same person.
Record-level duplicate detection, or record matching, may be needed because of typographical
errors, or varying abbreviations in related records. Record matching may also be used as a substitute for detailed schema matching, which may be impossible for semi-structured data. For example
records often dier in the detailed format of personal names or addresses. Even if records follow
a xed high-level schema, some of their elds may not follow a xed low-level schema, i.e. the
division of elds into subelds may not be standardized.
In general, we are interested in situations where several records may refer to the same real-world
entity, while not being syntactically equivalent. A set of records that refer to the same entity can
be interpreted in two ways. One way is to view one of the records as correct and the other records
as duplicates containing erroneous information. The task then is to cleanse the database of the
duplicate records [41, 18]. Another interpretation is to consider each matching record as a partial
source of information. The aim is then to merge the duplicate records, yielding one record with
more complete information [21].
3.2. The standard method and its improvements
The standard method of detecting exact duplicates in a table is to sort the table and then
to check if neighboring tuples are identical. Exact duplicates are guaranteed to be next to each
other in the sorted order regardless of which part of a record the sort is performed on. There
are a number of optimizations of this approach an these are described in [4]. The approach can
be extended to detect approximate duplicates. The idea is to do sorting to achieve preliminary
clustering, and then to do pairwise comparisons of nearby records [38, 14, 16]. In this case, there
are no guarantees as to where duplicates are located relative to each other in the sorted order. In
a good scenario, the approximate duplicate records may not be found next to each other but will
be found nearby. In the worse case, they will be found in opposite extremes of the sorted order.
The result depends on the eld used to sort and on the probability of error in that eld. Thus,
sorting is typically based on an application-specic key chosen to make duplicate records likely to
appear near each other.
In [18], the authors compare nearby records by sliding a window of xed size over the sorted
database. If the window has size W then record i is compared with records i ? W + 1 through
i ? 1 if i W and with records 1 through i ? 1 otherwise. The number of comparisons performed
is O(TW) where T is the total number of records in the database.
In order to improve accuracy, the results of several passes of duplicate detection can be combined
[38, 24]. Typically, combining the results of several passes over the database with small window
sizes yields better accuracy for the same cost than one pass over the database with a large window
size.
One way to combine the results of multiple passes is by explicitly computing the transitive
closure of all discovered pairwise \is a duplicate of" relationships [18]. If record R1 is a duplicate
of record R2, and record R2 is a duplicate of record R3 , then by transitivity R1 is a duplicate
of record R3 . Transitivity is true by denition if duplicate records concern the same real-world
identity, but in practice there will always be errors in computing pairwise \is a duplicate of"
relationships, and transitivity will propagate these errors. However, in typical databases, sets
of duplicate records tend to be distributed sparsely over the space of possible records, and the
propagation of errors is rare. The experimental results conrm this claim [18, 19] and Section 4 of
this paper.
Hylton uses a dierent, more expensive, method to do a preliminary grouping of records [21].
Each record is considered separately as a \source record" and used to query the remaining records
in order to create a group of potentially matching records. Then each record in the group is
compared with the source record using his pairwise matching procedure.
Detecting approximately duplicate records
7
Finally, similarity of entire documents is also related to this body of work. In [6], the authors
provide a method for determining document similarity and use it to build a clustering of syntactically similar documents. It would be expensive to try to compare documents in their entirety.
Thus, the authors calculate a sketch of each document, where the size of a sketch is in the order
of 100's of bytes. The sketch is based on the unique contiguous subsequences of words contained
in the document, called shingles by the authors. The authors show that document similarity is
not compromised if the sketches of documents are compared instead of the entire document. To
compute the clusters of similar documents, the authors must rst calculate the number of shingles
shared between documents. The more shingles two documents have in common, the more similar
they are. When two documents share enough shingles to be deemed similar, they are put in the
same cluster. To maintain the clusters the auhors also use the union-nd data structure. However,
before any cluster gets created, all the document comparisons have been performed. As we will
see later, in the algorithm presented in this paper, the union-nd data structure is used more
eciently allowing for many record comparisons to not be performed. In addition, the system also
queries the union-nd data structure to improve accuracy by performing some additional record
comparisons.
3.3. Transitivity and the duplicate detection problem
Under the assumption of transitivity, the problem of detecting duplicates in a database can
be described in terms of keeping track of the connected components of an undirected graph. Let
the vertices of a graph G represent the records in a database of size T. Initially, G will contain T
unconnected vertices, one for each record in the database. There is an undirected edge between
two vertices if and only if the records corresponding to the pair of vertices are found to match,
according to the pairwise record matching algorithm. When considering whether to apply the
expensive pairwise record matching algorithm to two records, we can query the graph G. If both
records are in the same connected component, then it has been determined previously that they are
approximate duplicates, and the comparison is not needed. If they belong to dierent components,
then it is not known whether they match or not. If comparing the two records results in a match,
their respective components should be combined to create a single new component. This is done
by inserting an edge between the vertices that correspond to the records compared.
At any time, the connected components of the graph G correspond to the transitive closure
of the \is a duplicate of" relationships discovered so far. Consider three records Ru, Rv , and Rw
and their corresponding nodes u, v, and w. When the fact that Ru is a duplicate of record Rv
is detected, an edge is inserted between the nodes u and v, thus putting both nodes in the same
connected component. Similarly, when the fact that Rv is a duplicate of Rw is detected, and edge
is inserted between nodes v and w. Transitivity of the \is a duplicate of" relation is equivalent
to reachability in the graph. Since w is reachable from u (and vice versa), the corresponding
records Ru and Rw are duplicates. This \is a duplicate of" relationship is detected automatically
by maintaining the graph G, without comparing Ru and Rw .
3.4. The Union-Find data structure
There is a well-known data structure that eciently solves the problem of incrementally maintaining the connected components of an undirected graph, called the union-nd data structure
[20, 10]. This data structure keeps a collection of disjoint updatable sets, where each set is identied by a representative member of the set. Each set correponds to a connected component of the
graph. The data structure has two operations:
Union(x; y) combines the sets that contain node x and node y, say Sx and Sy , into a new set that
is their union Sx [ Sy . A representative for the union is chosen, and the new set replaces Sx
and Sy in the collection of disjoint sets.
Find(x) returns the representative of the unique set containing x. If Find(x) is invoked twice
without modifying the set between the requests, the answer is the same.
8
A. E. Monge
To nd the connected components of a graph G, we rst create jGj singleton sets, each containing
a single node from G. For each edge (u; v) 2 E(G), if Find(u) 6= Find(v) then we perform
Union(u; v). At any time, two nodes u and v are in the same connected component if and only
if their sets have the same representative, that is if and only if Find(u) = Find(v). Note that
the problem of incrementally computing the connected components of a graph is harder than just
nding the connected components. There are linear time algorithms for nding the connected
components of a graph. However, here we require the union-nd data structure because we need
to nd the connected components incrementally as duplicate records are detected.
3.5. Further improvements on the standard algorithm
The previous section described a way in which to maintain the clusters of duplicate records and
compute the transitive closure of \is a duplicate of" relationships incrementally. This section uses
the union-nd data structure to improve the standard method for detecting approximate duplicate
records.
As done by other algorithms, the algorithm performs multiple passes of sorting and scanning.
Whereas previous algorithms sort the records in each pass according to domain-specic criteria,
this work proposes to use domain-independent sorting criteria. Specically, the algorithm uses
two passes. The rst pass treats each record as one long string and sorts these lexicographically,
reading from left to right. The second pass does the same reading from right to left.
After sorting, the algorithm scans the database with a xed size window. Initially, the unionnd data structure (i.e. the collection of dynamic sets) contains one set per record in the database.
The window slides through the records in the sorted database one record at a time (i.e. windows
overlap). In the standard window method, the new record that enters the window is compared
with all other records in the window. The same is done in this algorithm, with the exception that
some of these comparisons are unnecessary. A comparison is not performed if the two records
are already in the same cluster. This can be easily determined by querying the union-nd data
structure. When considering the new record Rj in the window and some record Ri already in the
window, rst the algorithm tests whether they are in the same cluster. This involves comparing
their respective cluster representatives, that is, comparing the value of Find(Rj ) and Find(Ri).
If both these values are the same, then no comparison is needed because both records belong to
the same cluster or connected component. Otherwise the two records are compared. When the
comparison is successful, a new \is a duplicate of" relationship is established. To reect this in
the union-nd data structure, the algorithm combines the clusters corresponding to Rj and Ri by
making the function call: Union(Rj ; Ri).
Section 4 has the results of experiments comparing this improved algorithm to the standard
method. We expect that the improved algorithm will perform fewer comparisons. Fewer comparisons usually translates to decreased accuracy. However, similar accuracy is expected because
the comparisons which are not performed correspond to records which are already members of a
cluster, most likely due to the transitive closure of the \is a duplicate of' relationships. In fact,
all experiments show that the improved algorithm is as accurate as the standard method while
signicantly performing many fewer record comparisons.
3.6. The overall priority queue algorithm
The algorithm described in the previous section has the weakness that the window used for
scanning the database records is of xed size. If a cluster in the database has more duplicate
records than the size of the window, then it is possible that some of these duplicates will not be
detected because not enough comparisons are being made. Furthermore if a cluster has very few
duplicates or none at all, then it is possible that comparisons are being done which may not be
needed. An algorithm is needed which responds adaptively to the size and homogeneity of the
clusters discovered as the database is scanned. This section describes such a strategy. This is the
high-level strategy adopted in the duplicate detection algorithm proposed in this work.
Before describing the algorithm, we need to analyze the xed size window method. The xed
size window algorithm eectively saves the last jW j ? 1 records for possible comparisons with the
Detecting approximately duplicate records
9
new record that enters the window as it slides by one record. The key observation to make is that
in most cases, it is unnecessary to save all these records. The evidence of this is that sorting has
already placed approximate duplicate records near each other. Thus, most of the jW j ? 1 records
in the window already belong to the same cluster. The new record will either become a member of
that cluster, if it is not already a member of it, or it will be a member of an entirely dierent cluster.
In either case, exactly one comparison per cluster represented in the window is needed. Since in
most cases, all the records in the window will belong to the same cluster, only one comparison will
be needed. Thus, instead of saving individual records in a window, the algorithm saves clusters.
This leads to the use of a priority queue, in place of a window, to save record clusters. The rest of
this section describes this strategy as it is embedded in the duplicate detection system.
First, like the algorithm described in the previous section, two passes of sorting and scanning
are performed. The algorithm scans the sorted database with a priority queue of record subsets
belonging to the last few clusters detected. The priority queue contains a xed number of sets of
records. In all the experiments reported below this number is 4. Each set contains one or more
records from a detected cluster. For eciency reasons, entire clusters should not always be saved
since they may contain many records. On the other hand, a single record may be insucient to
represent all the variability present in a cluster. Records of a cluster will be saved in the priority
queue only if they add to the variability of the cluster being represented. The set representing the
cluster with the most recently detected cluster member has highest priority in the queue, and so
on.
The algorithm scans through the sorted database sequentially. Suppose that record Rj is the
record currently being considered. The algorithm rst tests whether Rj is already known to be a
member of one of the clusters represented in the priority queue. This test is done by comparing
the cluster representative of Rj to the representative of each cluster present in the priority queue.
If one of these comparisons is successful, then Rj is already known to be a member of the cluster
represented by the set in the priority queue. We move this set to the head of the priority queue
and continue with the next record, Rj +1.
Whatever their result, these comparisons are computationally inexpensive because they are
done just with Find operations. In the rst pass, Find comparisons are guaranteed to fail since
the algorithm scans the records in the sorted database sequentially and this is the rst time each
record is encountered. Therefore these tests are avoided in the rst pass.
Next, in the case where Rj is not a known member of an existing priority queue cluster, the
algorithm uses the Smith-Waterman algorithm to compare Rj with records in the priority queue.
The algorithm iterates through each set in the priority queue, starting with the highest priority
set. For each set, the algorithm scans through the members Ri of the set. Rj is compared to Ri
using the Smith-Waterman algorithm. If a match is found, then Rj 's cluster is combined with Ri 's
cluster, using a Union(Ri; Rj ) operation. In addition, Rj may also be included in the priority queue
set that represents Ri 's cluster { and now also represents the new combined cluster. Specically,
Rj is included if its Smith-Waterman matching score is below a certain \strong match" threshold.
This priority queue cluster inclusion threshold is higher than the threshold for declaring a match,
but lower than 1:0. Intuitively, if Rj is very similar to Ri, it is not necessary to include it in the
subset representing the cluster, but if Rj is only somewhat similar, i.e. its degree of match is below
the inclusion threshold, then including Rj in the subset will help in detecting future members of
the cluster.
On the other hand, if the Smith-Waterman comparison between Ri and Rj yields a very low
score, below a certain \bad miss" threshold, then the algorithm continues directly with the next
set in the priority queue. The intuition here is that if Ri and Rj have no similarity at all, then
comparisons of Rj with other members of the cluster containing Ri will likely also fail. If the
comparison still fails but the score is close to the matching threshold, then it is worthwhile to
compare Rj with the remaining members of the cluster. The \strong match" and \bad miss"
thresholds are used to counter the errors which are propagated when computing pairwise \is a
duplicate of" relationships.
Finally, if Rj is compared to members of each set in the priority queue without detecting that
it is a duplicate of any of these, then Rj must be a member of a cluster not currently represented
A. E. Monge
10
Equational Smith-Waterman Soc. Sec.
theory
score
number
True
0.6851
missing
positive
missing
False
0.4189
152014425
negative
152014423
0.3619
274158217
False
267415817
positive
0.1620
760652621
765625631
Name
Address
City, State
Zip code
Colette Johnen
600 113th St. apt. 5a5 missing
John Colette
600 113th St. ap. 585 missing
Bahadir T Bihsya
220 Jubin 8s3
Toledo OH 43619
Bishya T ulik
318 Arpin St 1p2
Toledo OH 43619
Frankie Y Gittler
PO Box 3628
Gresham, OR 97080
Erlan W Giudici
PO Box 2664
Walton, OR 97490
Arseneau N Brought 949 Corson Ave 515
Blanco NM 87412
Bogner A Kuxhausen 212 Corson Road 0o3 Raton, NM 87740
Table 1: Example pairs of records and the status of the matching algorithms.
in the priority queue. In this case Rj is saved as a singleton set in the priority queue, with the
highest priority. If this action causes the size of the priority queue to exceed its limit then the
lowest priority set is removed from the priority queue.
4. EXPERIMENTAL RESULTS
The rst experiments reported here use databases that are mailing lists generated randomly
by software designed and implemented by [19]. Each record in a mailing list contains nine elds:
social security number, rst name, middle initial, last name, address, apartment, city, state, and
zip code. All eld values are chosen randomly and independently. Personal names are chosen from
a list of 63000 real names. Address elds are chosen from lists of 50 state abbreviations, 18670 city
names, and 42115 zip codes.
Once the database generator creates a random record, it creates a random number of duplicate records according to a xed probability distribution. When it creates a duplicate record, the
generator introduces errors (i.e. noise) into the record. Possible errors range from small typographical slips to complete name and address changes. The generator introduces typographical errors
according to frequencies known from previous research on spelling correction algorithms [27]. Editdistance algorithms are designed to detect some of the errors introduced, however, our algorithm
was developed without knowledge of the particular error probabilities used by the database generator. The pairwise record matching algorithm of [18] has special rules for transpositions of entire
words, complete changes in names and zip codes, and social security number omissions, while our
Smith-Waterman algorithm variant does not.
Table 1 contains example pairs of records chosen as especially instructive by [19], with pairwise scores assigned by the Smith-Waterman algorithm. The rst pair is correctly detected to be
duplicates by the rules of [18]. The Smith-Waterman algorithm classies it as duplicate given any
threshold below 0.68. The equational theory does not detect the second pair as duplicates. The
Smith-Waterman algorithm performs correctly on this pair when the duplicate detection threshold
is set at 0.41 or lower. Finally, the equational theory falsely nds the third and fourth pairs to be
duplicates. The Smith-Waterman algorithm performs correctly on these pairs with a threshold of
0.37 or higher. These examples suggest that we should choose a threshold around 0.40. However
this threshold is somewhat aggressive. Small further experiments show that a more conservative threshold of 0.50 detects most real duplications while keeping the number of false positives
negligible.
4.1. Measuring accuracy
The measure of accuracy used in this paper is based on the number of clusters detected that
are \pure". A cluster is pure if and only if it contains only records that belong to the same true
cluster of duplicates. This accuracy measure considers entire clusters, not individual records. This
is intuitive, since a cluster corresponds to a real world entity, while individual records do not. A
cluster detected by a duplicate detection algorithm can be classied as follows:
1. the cluster is equal to a true cluster or,
Detecting approximately duplicate records
11
4
7
x 10
6
Number of clusters
5
4
true clusters
3
PQS w/SW (pure clusters)
merge/purge (pure and impure clusters)
PQS w/HS: (pure clusters)
2
impure clusters
1
0
5
5.5
6
6.5
7
7.5
Average number of duplicates per original record
8
8.5
Fig. 2: Accuracy results for varying the number of duplicates per original record using a Zipf distribution.
2. the cluster is a subset of a true cluster or,
3. the cluster contains some of two or more true clusters.
By this denition, a pure cluster falls in either of the rst two cases above. Clusters that fall in
the last case are referred to as \impure" clusters. A good detection algorithm will have 100% of
the detected clusters as pure and 0% as impure.
4.2. Algorithms tested
The sections that follow provide the results from experiments performed in this study. Several
algorithms are compared, where each is made up of dierent features. The main features that
make up an algorithm are the pairwise record matching algorithm used and the structure used for
storing records for possible comparisons. The three main algorithms compared are the so-called
\Merge/Purge" algorithm [18, 19] and two versions of the algorithm from Section 3.6. One version,
\PQS w/SW", uses the Smith-Waterman algorithm to match records, while the second version,
\PQS w/HS", uses the equational theory described in [18]. For short, priority queue strategy is
abbreviated by PQS.
Both PQS-based algorithms use the union-nd data structure described in Section 3.4. Both
also use the priority queue of cluster subsets discussed in Section 3.6. The equational theory
matcher returns only either 0 or 1, so unlike the Smith-Waterman algorithm, it does not estimate
degrees of pairwise matching. Thus, we need to modify the strategy of keeping in the priority
queue a set of representative records for a cluster. In the current implementation of the PQS
w/HS algorithm, only the most recently detected member of each cluster is kept.
For both PQS-based algorithms, the gures show the number of pure and impure clusters that
were detected. The gures also show the number of true clusters in the database, and the number
of clusters detected by merge-purge. Unfortunately the merge-purge software does not distinguish
between pure and impure clusters. The accuracy results reported here include both pure and
impure clusters thus slightly overstating its accuracy.
4.3. Varying number of duplicates per record
A good duplicate detection algorithm should be almost unaected by changes in the number
of duplicates that each record has. To study the eect of increasing this number, we varied the
number of duplicates per record using a Zipf distribution.
Zipf distributions give high probability to small numbers of duplicates, but still give non-trivial
probability to large numbers of duplicates. A Zipf distribution has two parameters 0 1 and
1 D. For 1 i D the probability of i duplicates is ci?1 where the normalization constant
A. E. Monge
12
6
10
5
10
4
Number of clusters
10
true clusters
3
10
U/F + |V|=10 w/SW (impure clusters)
U/F + PQS w/SW (impure clusters)
U/F + PQS w/HS (impure clusters)
2
10
1
10
0
10 4
10
5
10
Total number of records in the database
6
10
Fig. 3: Accuracy results for varying database sizes (log-log plot).
P
P
?1
c = 1= Di=1 i?1 . Having a maximum number of duplicates D is necessary because 1
i=1 i
diverges if 0.
Four databases were created, each with a dierent value of the parameter from the set
f0:1; 0:2; 0:4;0:8g. The maximum number of duplicates per original record was kept constant at
20. The noise level was also maintained constant. The sizes of the databases ranged from 301153
to 480239 total records.
In all experiments, the Merge/Purge engine was run with xed window of size 10, as in most
experiments performed by [18]. Our duplicate detection algorithm used a priority queue containing
at most 4 sets of records. This number was chosen to make the accuracy of both algorithms
approximately the same. Of course, it is easy to run our algorithm with a larger priority queue in
order to obtain greater accuracy.
Figure 2 shows that our algorithm performs slightly better than the Merge/Purge engine. The
number of pure clusters detected by both algorithms increases slowly as the value of theta is
increased. This increase constitutes a decrease in accuracy since we want to get as close to the
number of true clusters as possible. As desired, the number of impure clusters remains very small
throughout. The fact that nearly 100% of the detected clusters are pure suggests that we could
relax various parameters of our algorithm in order to combine more clusters, without erroneously
creating too many impure clusters.
4.4. Varying the size of the database
the accuracy of duplicate detection. We consider databases with respectively 10000, 20000,
40000, 80000 and 120000 original records. In each case, duplicates are generated using a Zipf
distribution with a high noise level, Zipf parameter = 0:40, and 20 maximum duplicates per
original record. The largest database considered here contains over 900000 records in total.
Figures 3 and 4 show the performance of the PQS-based algorithms and the merge-purge
algorithm on these databases. Again, the same algorithm parameters as before are used. The
gures clearly display the benets of the PQS-based algorithm. Figure 3 shows that the number of
clusters detected by all strategies is similar, with our the PQS-based strategy having slightly better
accuracy. While the algorithms detect nearly the same number of clusters, they do not achieve
this accuracy with similar numbers of record comparisons. As shown in gure 4, the PQS-based
algorithms are doing many fewer pairwise record comparisons. For the largest database tested,
the PQS-based strategy performs about 3.0 million comparisons, while the merge-purge algorithm
performs about 18.8 million comparisons. This is six times as many comparisons as the PQS-based
algorithm that uses the same pairwise matching method and achieves essentially the same accuracy.
Overall, these experiments show the signicant improvement that the union-nd data structure
and the priority queue of cluster subsets strategy have over the merge-purge algorithm. The best
Detecting approximately duplicate records
13
8
Number of record comparisons
10
7
10
merge/purge
U/F + |W|=10 w/SW
U/F + PQS w/SW
U/F + PQS w/HS
6
10
5
10 4
10
5
10
Total number of records in the database
6
10
Fig. 4: Number of comparisons performed by the algorithms (log-log plot).
depiction of this is in comparing the merge-purge algorithm with the PQS w/HS algorithm. In both
of these cases, the exact same record matching function is used. The dierence is in the number
of times this function is applied. The merge-purge algorithm applies the record matching function
on records that fall within a xed size window, thus making unnecessary record comparisons. The
PQS w/HS and the PQS w/SW algorithms apply the record matching function more eectively
through the use of the union-nd data structure and priority queue of sets.
This savings in number of comparisons performed is crucial when dealing with very large
databases. Our algorithm responds adaptively to the size and the homogeneity of the clusters
discovered as the database is scanned. These results do not depend on the record matching
algorithm which is used. Instead, the savings is due to the maintenance of the clusters in the
union-nd data structure and the use of the priority queue to determine which records to compare.
In addition to these benets, the experiments also show that there is no loss in accuracy when
using the Smith-Waterman algorithm over one which uses domain-specic knowledge.
5. DETECTING APPROXIMATE DUPLICATE RECORDS IN A REAL BIBLIOGRAPHIC
DATABASE
This section looks at the eectiveness of the algorithm on a real database of bibliographic
records describing documents in various elds of computer science published in several sources.
The database is a slightly larger version of one used by [21]. He presents an algorithm for detecting
bibliographic records which refer to the same work. The task here is not to purge the database
of duplicate records, but instead to create clusters that contain all records about the same entity.
An entity in this context is one document, called a \work", that may exist in several versions. A
work may be a technical report which later appears in the form of a conference paper, and still
later as a journal paper. While the bibliographic records contain many elds, the algorithm of [21]
considers only the author and the title of each document, as do the methods tested here.
The bibliographic records were gathered from two major collections available over the Internet.
The primary source is A Collection of Computer Science Bibliographies assembled by Alf-Christian
[2]. Over 200000 records were taken from this collection, which currently contains over 600000 BibTeX records. The secondary source is a collection of computer science technical reports produced
by ve major universities in the CS-TR project [49, 23]. This contains approximately 6000 records.
In total, the database used contains 254618 records. Since the records come from a collection of
bibliographies, the database contains multiple BibTeX records for the same document. In addition, due to the dierent sources, the records are subject to typographical errors, errors in the
accuracy of the information they provide, variation in how they abbreviate author names, and
more.
14
A. E. Monge
Cluster Number of Number of
size
clusters
records
1
118149
118149
2
28323
56646
3
8403
25209
4
3936
15744
5
1875
9375
6
1033
6198
7
626
4382
8+
1522
18915
total
163867
254618
[21]
162535
242705
% of all
records
46.40%
22.25%
9.90%
6.18%
3.68%
2.43%
1.72%
7.40%
100.00%
100.00%
Table 2: Results of duplicate detection on a database of bibliographic records
To apply the duplicate detection algorithm to the database, rst we created simple representative records from the complete BibTeX records. Each derived record contains the author names
and the document title from one BibTeX record. As in all other experiments, the PQS-based algorithm uses two passes over the database and uses the same domain-independent sorting criteria.
Small experiments allowed us to determine that the best Smith-Waterman algorithm threshold for
a match in this database was 0.65. The threshold is higher for this database because it has less
noise than the synthetic databases used in other experiments.
Results for the PQS-based algorithm using a priority queue size of 4 are presented in Table 2.
The algorithm detected a total of 163867 clusters, with an average of 1.60 records per cluster.
The true number of duplicate records in this database is not known. However, based on visual
inspection, the great majority of detected clusters are pure.
The number of clusters detected by the PQS-based algorithm is comparable to the results of
[21] on almost the same database. However, Hylton reports making 7.5 million comparisons to determine the clusters whereas the PQS-based algorithm performs just over 1.6 million comparisons.
This savings of over 75% is comparable to the savings observed on synthetic databases.
6. CONCLUSION
The integration of information sources is an important area of research. There is much to
be gained from integrating multiple information sources. However, there are many obstacles that
must be overcome to obtain valuable results from this integration. This article has explored and
provided solutions to some of the problems to be overcome in this area. In particular, to integrate
data from multiple sources, one must rst identify the information which is common in these
sources. Dierent record matching algorithms were presented that determine the equivalence of
records from these sources. Section 3.5 presents the Smith-Waterman algorithm that should be
useful for typical alphanumeric records that contain elds such as names, addresses, titles, dates,
identication numbers, and so on.
The Smith-Waterman algorithm was successfully applied to the problem of detecting duplicate
records in databases of mailing addresses and of bibliographic records without any changes to
the algorithm. Although the Smith-Waterman component does have several tunable parameters,
in typical alphanumeric domains we are condent that the numerical parameters suggested in
Section 2.3 can be used without change. The one parameter that should be changed for dierent
applications is the threshold for declaring a match. This threshold is easy to set by examining
a small number of pairs of records whose true matching status is known. In Section 5, we used
the Smith-Waterman algorithm in detecting duplicate bibliographic records and compared it to
the algorithm developed by Hylton [21]. We cannot perform experiments that use the equational
theory of [18] because that equational theory only applies to mailing list records. An entirely
Detecting approximately duplicate records
15
new equational theory for bibliographic records would have to be written in order to make these
comparisons. Since the Smith-Waterman algorithm is domain independent, we successfully used
the same algorithm as in the experiments of the previous section without any modications. Only
the thresholds needed to be adjusted for this database. Future work should investigate automated
methods for learning optimal values for the thresholds and the other Smith-Waterman algorithm
parameters.
The duplicate detection methods described in this work improve previous related work in three
ways. The rst contribution is an approximate record matching algorithm that is relatively domainindependent. However, this algorithm, an adaptation of the Smith-Waterman algorithm, does
have parameters that can in principle be optimized (perhaps automatically) to provide better
accuracy in specic applications. The second contribution is to show how to compute the transitive
closure of \is a duplicate of" relationships incrementally, using the union-nd data structure.
The third contribution is a heuristic method for minimizing the number of expensive pairwise
record comparisons that must be performed while comparing individual records with potential
duplicates. It is important to note that the second and third contributions can be combined
with any pairwise record matching algorithm. In particular, we performed experiments on two
algorithms that contained these contributions, but that used dierent record matching algorithms.
The experiments resulted in high duplicate detection accuracy while signicantly performing many
fewer record comparisons than previous related work.
REFERENCES
[1] J. Ace, B. Marvel, and B. Richer. Matchmaker matchmaker nd me the address (exact address match
processing). Telephone Engineer and Management, 96(8):50,52{53 (1992).
[2] Alf-Christian
Achilles.
A
collection of computer science bibliographies. URL, http://liinwww.ira.uka.de/bibliography/index.html
(1996).
[3] C. Batini, M. Lenzerini, and S. Navathe. A comparative analysis of methodologies for database schema integration. ACM Computing Surveys, 18(4):323{364 (1986).
[4] D. Bitton and D. J. DeWitt. Duplicate record elimination in large data les. ACM Transactions on Database
Systems, 8(2):255{65 (1983).
[5] Robert S. Boyer and J. Strother Moore. A fast string-searching algorithm. Communications of the ACM,
20(10):762{772 (1977).
[6] Andrei Broder, Steve Glassman, Mark Manasse, and Georey Zweig. Syntactic clustering of
the web. In Proceedings of the Sixth International World Wide Web Conference, pp. 391{404,
http://www.scope.gmd.de/info/www6/technical/paper205/paper205.html (1997).
[7] L. Brownston, R. Farrell, and E. Kant. Programming Expert Systems in OPS5 An Introduction to Rule-Based
Programming. Addison-Wesley Publishing Company (1985).
[8] U.S. Census Bureau, editor. U.S. Census Bureau's 1997 Record Linkage Workshop, Arlington, Virginia.
Statistical Research Division, U.S. Census Bureau (1997).
[9] W. I. Chang and J. Lampe. Theoretical and empirical comparisons of approximate string matching algorithms.
In CPM: 3rd Symposium on Combinatorial Pattern Matching, pp. 175{84 (1992).
[10] Thomas H. Cormen, Charles E. Leiserson, and Roland L. Rivest. Introduction to Algorithms. MIT Press
(1990).
[11] Brenda G. Cox. Business survey methods. John Wiley & Sons, Inc., Wiley series in probability and mathematical statistics (1995).
[12] M.-W. Du and S. C. Chang. Approach to designing very fast approximate string matching algorithms. IEEE
Transactions on Knowledge and Data Engineering, 6(4):620{633 (1994).
[13] Oren Etzioni and Mike Perkowitz. Category translation: learning to understand information on the Internet.
In Proceedings of the International Joint Conference on AI, pp. 930{936 (1995).
[14] I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Association,
64:1183{1210 (1969).
[15] Z. Galil and R. Giancarlo. Data structures and algorithms for approximate string matching. Journal of
Complexity, 4:33{72 (1988).
[16] C. A. Giles, A. A. Brooks, T. Doszkocs, and D.J. Hummel. An experiment in computer-assisted duplicate
checking. In Proceedings of the ASIS Annual Meeting, page 108 (1976).
:::
:::
16
A. E. Monge
[17] Patrick A. V. Hall and Geo R. Dowling. Approximate string matching. ACM Computing Surveys, 12(4):381{
402 (1980).
[18] M. Hernandez and S. Stolfo. The merge/purge problem for large databases. In Proceedings of the ACM
SIGMOD International Conference on Management of Data, pp. 127{138 (1995).
[19] Mauricio Hernandez. A Generalization of Band Joins and the Merge/Purge Problem. Ph.D. thesis, Columbia
University (1996).
[20] J. E. Hopcroft and J. D. Ullman. Set merging algorithms. SIAM Journal on Computing, 2(4):294{303 (1973).
[21] Jeremy A. Hylton. Identifying and merging related bibliographic records. M.S. thesis, MIT, Published as MIT
Laboratory for Computer Science Technical Report 678 (1996).
[22] C. Jacquemin and J. Royaute. Retrieving terms and their variants in a lexicalized unication-based framework.
In Proceedings of the ACM-SIGIR Conference on Research and Development in Information Retrieval, pp.
132{141 (1994).
[23] Robert
E.
Kahn.
An
introduction to the CS-TR project [WWW document]. URL, http://www.cnri.reston.va.us/home/cstr.html
(1995).
[24] Beth Kilss and Wendy Alvey, editors. Record linkage techniques, 1985: Proceedings of the Workshop on Exact
Matching Methodologies, Arlington, Virginia. Internal Revenue Service, Statistics of Income Division, U.S.
Internal Revenue Service, Publication 1299 (2-86) (1985).
[25] W. Kim, I. Choi, S. Gala, and M. Scheevel. On resolving schematic heterogeneity in multidatabase systems.
Distributed and Parallel Databases, 1(3):251{279 (1993).
[26] Donald E. Knuth, James H. Morris Jr., and Vaughan R. Pratt. Fast pattern matching in strings. SIAM Journal
on Computing, 6(2):323{350 (1977).
[27] K. Kukich. Techniques for automatically correcting words in text. ACM Computing Surveys, 24(4):377{439
(1992).
[28] V. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics { Doklady
10, 10:707{710 (1966).
[29] M. Madhavaram,D. L. Ali, and Ming Zhou. Integratingheterogeneousdistributeddatabasesystems. Computers
& Industrial Engineering, 31(1{2):315{318 (1996).
[30] Tova Milo and Sagit Zohar. Using schema matching to simplify heterogeneous data translation. In Ashish
Gupta, Oded Shmueli, and Jennifer Widom, editors, VLDB'98, Proceedings of 24rd International Conference
on Very Large Data Bases, August 24-27, 1998, New York City, New York, USA, pp. 122{133. Morgan
Kaufmann (1998).
[31] Alvaro E. Monge and Charles P. Elkan. WebFind: Automatic retrieval of scientic papers over the world wide
web. In Working notes of the Fall Symposium on AI Applications in Knowledge Navigation and Retrieval,
page 151. AAAI Press (1995).
[32] Alvaro E. Monge and Charles P. Elkan. The eld matching problem: Algorithms and applications. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, pp. 267{270. AAAI
Press (1996).
[33] Alvaro E. Monge and Charles P. Elkan. WebFind: Mining external sources to guide www discovery [demo]. In
Proceedings of the Second International Conference on Knowledge Discovery and Data Mining. AAAI Press
(1996).
[34] Alvaro E. Monge and Charles P. Elkan. The WebFind tool for nding scientic papers over the worldwide web.
In Proceedings of the 3rd International Congress on Computer Science Research, pp. 41{46, Tijuana, Baja
California, Mexico (1996).
[35] Alvaro E. Monge and Charles P. Elkan. An ecient domain-independent algorithm for detecting approximately
duplicate database records. In Proceedings of SIGMOD Workshop on Research Issues on Data Mining and
Knowledge Discovery, Tucson, Arizona (1997).
[36] S. B. Needleman and C. D. Wunsch. A general method applicable to the search for similarities in the amino
acid sequences of two proteins. Journal of Molecular Biology, 48:443{453 (1970).
[37] Howard B. Newcombe. Handbook of record linkage: methods for health and statistical studies, administration,
and business. Oxford University Press (1988).
[38] Howard B. Newcombe, J. M. Kennedy, S. J. Axford, and A. P. James. Automatic linkage of vital records.
Science, 130:954{959, Reprinted in [24]. (1959).
[39] J. Peterson. Computer programs for detecting and correcting spelling errors. Communications of the ACM,
23(12):676{687 (1980).
[40] T. E. Senator, H. G. Goldberg, J. Wooton, and M. A. Cottini et al. The nancial crimes enforcement network
AI system (FAIS): identifying potential money launderingfrom reports of large cash transactions. AI Magazine,
16(4):21{39 (1995).
Detecting approximately duplicate records
17
[41] A. Silberschatz, M. Stonebraker, and J. D. Ullman. Database research: achievements and opportunities into
the 21st century. A report of an NSF workshop on the future of database research (1995).
[42] B. E. Slaven. The set theory matching system: an application to ethnographic research. Social Science
Computer Review, 10(2):215{229 (1992).
[43] T. F. Smith and M. S. Waterman. Identication of common molecular subsequences. Journal of Molecular
Biology, 147:195{197 (1981).
[44] W. W. Song, P. Johannesson, and J. A. Bubenko Jr. Semantic similarity relations and computation in schema
integration. Data & Knowledge Engineering, 19(1):65{97 (1996).
[45] Y. R. Wang, S. E. Madnick, and D. C. Horton. Inter-database instance identication in composite information
systems. In Proceedings of the Twenty-Second Annual Hawaii International Conference on System Sciences,
pp. 677{84 (1989).
[46] William E. Winkler. Advanced methods of record linkage. In American Statistical Association, Proceedings of
the Section of Survey Research Methods, pp. 467{472 (1994).
[47] William E. Winkler. Matching and Record Linkage, pp. 355{384. In Brenda G. Cox [11], Wiley series in
probability and mathematical statistics (1995).
[48] M. I. Yampolskii and A. E. Gorbonosov. Detection of duplicatesecondary documents. Nauchno-Tekhnicheskaya
Informatsiya, 1(8):3{6 (1973).
[49] T.W. Yan and H. Garcia-Molina. Information nding in a digital library: the Stanford perspective. SIGMOD
Record, 24(3):62{70 (1995).
Download