shilpa

advertisement
DUPLICATE RECORD DETECTION
AHMED K. ELMAGARMID
PURDUE UNIVERSITY, WEST LAFAYETTE, IN
Senior member, IEEE
PANAGIOTIS G. IPEIROTIS
LEONARD N. STERN SCHOOL OF BUSINESS, NEW YORK, NY
Member, IEEE computer security
VASSILIOS S. VERYKIOS
UNIVERSITY OF THESSALY, VOLOS, GREECE
Member ,IEEE computer security.
PRESENTED BY
SHILPA MURTHY
INTRODUCTION TO THE PROBLEM
Databases play an important role in today’s IT
based economy
 Many businesses and organizations depend on the
quality of data(or the lack thereof) stored in the
databases.
 Any discrepancies in the data can have
significant cost implications to a system that
relies on information to function.

DATA QUALITY
Data are not carefully controlled for quality nor
defined in a consistent way across different data
sources, thus data quality is compromised due to
many factors .//examples
 Data errors.Ex: Microsft instead of Microsoft
 Integrity errors. Ex: EmployeeAge=567
 Multiple conventions for information.Ex: 44 W.4th
street and 44 west fourth street.

DATA HETEROGENEITY
While integrating data from different sources
into a warehouse , organizations become aware
potential systematic differences and these
problems and conflicts fall under a umbrella term
called as “DATA HETEROGENEITY”.
 Two types of heterogeneity can be distinguished:
Structural heterogeneity and lexical
heterogeneity.

DATA QUALITY
Data cleaning refers to the process of resolving
identification problems in the data.
 Structural heterogeneity

Different record structure
 Addr versus City, State, and Zip code [1]


Lexical heterogeneity
Identical record structure, but data is different
th St. versus 44 West Fourth Street [1]
 44 W. 4

TERMINOLOGY
Record
linkage
Identity
uncertainty
Record
matching
Coreference
resolution
Data
deduplication
Duplicate
Record
Detection
Name
matching
Merge
purge
Database
hardening
Instance
identification
DATA PREPARATION
Step before the duplicate record detection.
 Improves the quality of the data
 Makes data more comparable and more usable.
 Data preparation stage includes three steps.

STEPS IN DATA PREPARATION
Parsing
 Data transformation
 Standardization

PARSING
Locates, identifies and isolates individual data
elements
 Makes it easier to correct, standardize and match
data
 Comparison of individual components rather
than complex strings
 For example, the appropriate parsing of the name
and address components into consistent packets
is a very important step.

DATA TRANSFORMATION
Simple conversions of data type
 Field renaming
 Decoding field values
 Range checking: involves examining data in a
field to ensure that it falls within the expected
range ,usually a numeric or date range
 Dependency checking: is slightly more complex
kind of data transformation where we check the
values in a particular field to the values in
another field to ensure minimal level of
consistency in data

DATA STANDARDIZATION

Represent certain fields in a standard format

Addresses

US Postal Service Address Verification tool
Date and time formatting
 Names (first, last, middle, prefix, suffix)
 Titles

LAST STEP IN DATA PREPARATION
Store data in tables having comparable fields.
 Identify fields suitable for comparison
 Not foolproof
 Data may still contain inconsistencies due to
misspellings and different conventions to
represent data

FIELD MATCHING TECHNIQUES
o
o
o
o
o
o
Most common sources of mismatches in database
entries is due to typographical errors
The field matching metrics that have been
designed to overcome this problem are :
Character –based similarity metrics
Token based similarity metrics
Phonetic similarity metrics
Numeric similarity metrics
CHARACTER BASED SIMILARITY
Works best on typographical errors
 Edit distance







Shortest sequence of edit commands that can
transform a string s into t
Three types of edit operations . If (cost =1) this
version of edit distance is referred to as the
“Levenshein” distance.
Insert, delete, replace operations.
Example. S1=“tin” s2= “tan”
We need to replace “I” to “A” to convert string s1 to
s2.
The edit distance here is 1. because we needed only
one operation to convert s1 to s2.
CHARACTER BASED SIMILARITY
•
Affine gap distance
Strings that have been truncated
John R. Smith versus Jonathan Richard Smith
•
Smith-Waterman distance
Substring matching which ignores the prefix and
suffix
Example: Prof. John.R.Smith and John.R.Smith,Prof
•
Jaro distance
Compares first and last name
•
Q-Grams
Divides string into a series of substrings of length q.
E.g.: NELSON and NELSEN are phonetically similar
but spelled differently. The q-grams for these words
are NE LS ON and NE LS EN .
TOKEN BASED SIMILARITY

Works best when word (tokens) are transposed
Atomic Strings

Computational average
WHIRL
Weights words based on frequency to determine similarity
 The words in the database have a weight associated with it,
which is calculated using a cosine similarity metric.
 Example: in a database of company names the words
“AT&T” and “IBM” are less frequent than the word “inc.”
 Similarity of John Smith and Mr.John Smith is close to 1.
 But the similarity of comptr department and deprtment of
computer is zero since it doesn’t take care of misspelled
words.

Q-Grams with weighting

Extends WHIRL to handle spelling errors
PHONETIC SIMILARITY

Comparison based on how words sound
NUMERIC SIMILARITY

Considers only numbers
Convert numbers to text data
 Simple range queries


Authors provided no insight in this area
SUMMARY OF METRICS
Edit
Distance
(Levenshtein)
Numeric
Affine Gap
Double
Metaphone
Smith
Waterman
Jaro
Distance
Metaphone
Comparison
Metrics
Oxford
Name
Compression
Atomic
Strings
NYSIIS
WHIRL
Soundex
Q-Grams
DUPLICATE RECORD DETECTION
The methods described till now have been
describing about similarity checking in single
fields.
 The real life situations consist of multiple fields
which have to be checked for duplicate records.

CATEGORIZING METHODS
•
•
Probabilistic approaches and supervised
machine learning techniques
Approaches that rely on domain knowledge
or Generic distance metrics
PROBABILISTIC MATCHINGMODELS

Models derived from Bayes theorem
Use prior knowledge to make decision about current
data set
 A tuple pair is assigned to one of the two classes M or
U. M class represents(match) same entity, and the U
class represents(non-match) different entity.
 This can be determined by calculating the probability
distribution.


Rule-based decision tree

If-then-else traversal
SUPERVISED LEARNING
Relies on the existence of trained data.
 The trained data is in the form of record pairs.
 These record pairs are labeled matching or not.
 SVM approach out performs all the simpler
approaches.
 The post processing step is to create a graph for
all the records linking the matching records.
 Records are considered identical using the
transitivity relation applied on the connected
components.

ACTIVE LEARNING
DISTANCE BASED TECHNIQUES
o
o
This method can be used when there is absence of
training data or human effort to create matching
models.
Treat a record as a one long field
Use a distance metric
 Best matches are ranked using a weighting algorithm
 Alternatively, use a single field


Must be highly discriminating
RULE BASED TECHNIQUES

Relies on business rules to derive key
Must determine functional dependencies
 Requires subject matter expert to build matching
rules

RULE BASED TECHNIQUES
This figure depicts the equation theory that
dictates the logic of domain equivalence.
 It specifies an inference about the similarity of
the records.

UNSUPERVISED LEARNING
Classify data as matched or unmatched without a
training set.
 The comparison vector generally depicts which
category it belongs to. If it does not then it has to
be done manually.
 One way to avoid manual labeling is to use the
clustering algorithms.
 Group together similar comparison vectors.
 Each cluster contains vectors with similar
characteristics.
 By knowing the real class of only few vectors we
can infer the class of all the vectors.

TECHNIQUES TO IMPROVE EFFICIENCY
Reduce the number of record comparisons
 Improve the efficiency of record comparison

COMPARATIVE METRICS

Elementary nested loop
Compare every record in one table to another table
 Requires A*B comparisons (Cartesian product) which
is very expensive


Cost required for a single comparison

Must consider number of fields/record
REDUCE RECORD COMPARISONS
Blocking
 Sorted Neighborhood
 Clustering and Canopies
 Set Joins

BLOCKING
 Basic:
record

Compute a hash value for each
Only compare records in the same bucket
 Subdivide

files into subsets (blocks)
Soundex, NYSIIS, or Metaphone
 Drawback
Increases in speed may increase number of
false mismatches
 Compromise is multiple runs using different
blocking fields

SORTED NEIGHBORHOOD
 Create
composite key, sort data, merge
 Assumption


Duplicate records will be close in sorted
system
Highly dependent upon the comparison key
CLUSTERING AND CANOPIES
Clustering: Duplicate records are kept in a
cluster and only the representative of a cluster is
kept for future comparisons.
 This reduces the total number of record
comparisons without compromising the accuracy.
 Canopies: The records are grouped into
overlapping clusters called as “canopies” and
then the records are compared which lead to
better qualitative results.

SOFTWARE TOOLS

Open Architecture





Freely Extensible Biomedical Record Linkage
(FEBRL) - Python
TAILOR – MLC++, DBGen
WHIRL – C++
Flamingo Project - C
BigMatch - C
DATABASE TOOLS

Commercial RDBMS
SQL Server 2005 implements “fuzzy matches”
 Oracle 11g implements these techniques in its
utl_match package




Levenshtein Distance
Soundex
Jaro Winkler
CONCLUSIONS
Lack of a standardized, large-scale benchmarking
data set
 Training data is needed to produce matching
models
 Research diversion





Databases emphasize simple, fast, and efficient
techniques
Machine learning and statistics rely on sophisticated
techniques and probabilistic models
More synergy is needed among various communities
Detection systems need to be adaptive over time
Download