Class slides July 26, 2000

advertisement
Data Quality
Class 10
Agenda
• Review of Last week
• Cleansing Applications
• Guest Speaker
Review of Similarity
• We use measures of similarity to establish
measures and thresholds for linkage
• Edit Distance
• Phonetic Similarity
• Ngrams
Linkage
• Critical component of data quality applications
• Linkage involves finding a link between a pair of
records, either through an exact match or through
an approximate match
• Linkage is useful for
–
–
–
–
Deduplification
Merge/Purge
Enhancement
Householding
Linkage 2
• Two records are linked when they match with
enough weighted similarity
• Matching can be a combination of exact matching
on particular fields, to approximat ematching with
some degree of similarity
• For example: two customer records can be linked
via an account number, if account numbers are
uniquely assigned to customers
Thresholding
• We have functions to determine similarity
• Tune these functions to return a value
between 0 and 1 (i.e., 0% and 100%)
• 0 = absolutely no match
• 1 = absolute match
Threhsolding 2
• For any pair of values, we can apply the
similarity function and get a score
• For different kinds of data, we can set a
minimum threshold, above which the two
values are said to match
• For example, for an n-gram match, we can
set the threshold at 75%
Thresholding 3
• When comparing 2 records, we apply the
similairty functions in a pairwise manner to each
of the important attribute value pairs
• We assign some weight to each attribute value pair
score
• Total similarity score for each pair of records is
the sum of individual attribute value pairs scores,
adjusted by weight
• We can assign an overall threshold indicating a
match
Thresholds and Matches
• We actually assign two thresholds, to partition
scores into three groups:
– Definite matches
– User review
– Definite no-matches
• Scores above the high threshold are definite
matches
• Scores between low and high thresholds are user
review matches
• All others are not matches
Deduplification
• Duplicates are sets of records that represent the
same entity
• Duplicate elimination is the process of finding
records that belong to the same equivalence class
based on similarity above a specified threshold
• When duplicates are found, one record is created
to replace all the suplicates
• That record is composed of the “best” data
gleaned from all equivalence class members
Merge/Purge
• Similar to duplicate elimination
• Application used when merging two or
more different database
• Goal: find all the records associated with
the same entity
• Example: when two banks merge, find all
accounts owned by the same person in both
banks
Enhancement
• We can enhance data by merging it with
other data sets
• Linkage may be based on profile
information extracted from each record
Householding
• We link on address as well as some
permutation of the entity name
• Look to establish a location match and some
relation between entities at that location
Naive Algorithm
• Goal: find all possible matches between any
pair of records
• Approach: Perform a pairwise similarity
score for all record pairs
• Downside: O(n2)
Improvements
• Desire to reduce the number of candidates
for pairwise similarity testing
• We can use fast matching for fixed pivot
values when merging data sets
• We use a concept called blocking to reduce
the search set
Fast Matching for Merging
• We have looked at Bloom filters for fast
matching
• We can load all (recordID,attribute value)
tuples from the first data set into the Bloom
filter O(n)
• We can then test each of the values from the
second set to see if the pivot matches in the
first set
Blocking
• Goal: reduce number of match candidates
by using some form of “compression” on
the records to be linked
• Example: phonetic encodings
• Example: limit by fixing one of the
attributes
• Example: find a pivot attribute and use that
for affinity
Blocking 2
• Example: if we want to perform
householding on a mailing list:
– Block by ZIP code, since we don’t expect to
find members of the same household living in
different locations
Linkage Algorithms
• All linkage algorithms make use of these ideas:
– A blocking mechanism
• We must choose based on the data available and what makes
sense for the application
– Similarity functions
• Every data type and data domain should have an associated
similarity function, even if it is a 0/1 exact match test
– Weights for the similarity functions
• This requires more insight into the problem, to see how each
attribute’s scores should weigh for the overall score
Download