Data Quality Class 10 Agenda • Review of Last week • Cleansing Applications • Guest Speaker Review of Similarity • We use measures of similarity to establish measures and thresholds for linkage • Edit Distance • Phonetic Similarity • Ngrams Linkage • Critical component of data quality applications • Linkage involves finding a link between a pair of records, either through an exact match or through an approximate match • Linkage is useful for – – – – Deduplification Merge/Purge Enhancement Householding Linkage 2 • Two records are linked when they match with enough weighted similarity • Matching can be a combination of exact matching on particular fields, to approximat ematching with some degree of similarity • For example: two customer records can be linked via an account number, if account numbers are uniquely assigned to customers Thresholding • We have functions to determine similarity • Tune these functions to return a value between 0 and 1 (i.e., 0% and 100%) • 0 = absolutely no match • 1 = absolute match Threhsolding 2 • For any pair of values, we can apply the similarity function and get a score • For different kinds of data, we can set a minimum threshold, above which the two values are said to match • For example, for an n-gram match, we can set the threshold at 75% Thresholding 3 • When comparing 2 records, we apply the similairty functions in a pairwise manner to each of the important attribute value pairs • We assign some weight to each attribute value pair score • Total similarity score for each pair of records is the sum of individual attribute value pairs scores, adjusted by weight • We can assign an overall threshold indicating a match Thresholds and Matches • We actually assign two thresholds, to partition scores into three groups: – Definite matches – User review – Definite no-matches • Scores above the high threshold are definite matches • Scores between low and high thresholds are user review matches • All others are not matches Deduplification • Duplicates are sets of records that represent the same entity • Duplicate elimination is the process of finding records that belong to the same equivalence class based on similarity above a specified threshold • When duplicates are found, one record is created to replace all the suplicates • That record is composed of the “best” data gleaned from all equivalence class members Merge/Purge • Similar to duplicate elimination • Application used when merging two or more different database • Goal: find all the records associated with the same entity • Example: when two banks merge, find all accounts owned by the same person in both banks Enhancement • We can enhance data by merging it with other data sets • Linkage may be based on profile information extracted from each record Householding • We link on address as well as some permutation of the entity name • Look to establish a location match and some relation between entities at that location Naive Algorithm • Goal: find all possible matches between any pair of records • Approach: Perform a pairwise similarity score for all record pairs • Downside: O(n2) Improvements • Desire to reduce the number of candidates for pairwise similarity testing • We can use fast matching for fixed pivot values when merging data sets • We use a concept called blocking to reduce the search set Fast Matching for Merging • We have looked at Bloom filters for fast matching • We can load all (recordID,attribute value) tuples from the first data set into the Bloom filter O(n) • We can then test each of the values from the second set to see if the pivot matches in the first set Blocking • Goal: reduce number of match candidates by using some form of “compression” on the records to be linked • Example: phonetic encodings • Example: limit by fixing one of the attributes • Example: find a pivot attribute and use that for affinity Blocking 2 • Example: if we want to perform householding on a mailing list: – Block by ZIP code, since we don’t expect to find members of the same household living in different locations Linkage Algorithms • All linkage algorithms make use of these ideas: – A blocking mechanism • We must choose based on the data available and what makes sense for the application – Similarity functions • Every data type and data domain should have an associated similarity function, even if it is a 0/1 exact match test – Weights for the similarity functions • This requires more insight into the problem, to see how each attribute’s scores should weigh for the overall score