Real-World Data Is Dirty Data Cleansing and the Merge/Purge Problem Hernandez & Stolfo: Columbia University - 1998 Class Presentation by Haiguang Li, 01. Dec 2011 Haiguang Li 01. Dec. 2011 1 TOPICS Introduction A Basic Data Cleansing Solution Test & Real World Results Incremental Merge Purge w/ New Data Conclusion Recap Haiguang Li 01. Dec. 2011 2 Introduction Haiguang Li 01. Dec. 2011 3 The problem: Some corporations acquire large amounts of information every month The data is stored in many large databases (DB) These databases may be heterogeneous Variations in schema The data may be represented differently across the various datasets Data in these DB may simply be inaccurate Haiguang Li 01. Dec. 2011 4 Requirement of the analysis The data mining needs to be done Quickly Efficiently Accurately Haiguang Li 01. Dec. 2011 5 Examples of real-world applications Credit card companies Assess risk of potential new customers Find false identities Match disparate records concerning a customer Mass Marketing companies Government agencies Haiguang Li 01. Dec. 2011 6 A Basic Data Cleansing Solution Haiguang Li 01. Dec. 2011 7 Duplicate Elimination Sorted-Neighborhood Method (SNM) This is done in three phases Create a Key for each record Sort records on this key Merge/Purge records Haiguang Li 01. Dec. 2011 8 SNM: Create key Compute a key for each record by extracting relevant fields or portions of fields Example: First Last Address ID Key Sal Stolfo 123 First Street 45678987 STLSAL123FRST456 Haiguang Li 01. Dec. 2011 9 SNM: Sort Data Sort the records in the data list using the key in step 1 This can be very time consuming O(NlogN) for a good algorithm, O(N2) for a bad algorithm Haiguang Li 01. Dec. 2011 10 SNM: Merge records Move a fixed size window through the sequential list of records. This limits the comparisons to the records in the window Haiguang Li 01. Dec. 2011 11 SNM: Considerations What is the optimal window size while Maximizing accuracy Minimizing computational cost Execution time for large DB will be bound by Disk I/O Number of passes over the data set Haiguang Li 01. Dec. 2011 12 Selection of Keys The effectiveness of the SNM highly depends on the key selected to sort the records A key is defined to be a sequence of a subset of attributes Keys must provide sufficient discriminating power Haiguang Li 01. Dec. 2011 13 Example of Records and Keys First Last Address ID Key Sal Stolfo 123 First Street 45678987 STLSAL123FRST456 Sal Stolfo 123 First Street 45678987 STLSAL123FRST456 Sal Stolpho 123 First Street 45678987 STLSAL123FRST456 Sal Stiles 123 Forest Street 45654321 STLSAL123FRST456 Haiguang Li 01. Dec. 2011 14 Equational Theory The comparison during the merge phase is an inferential process Compares much more information than simply the key The more information there is, the better inferences can be made Haiguang Li 01. Dec. 2011 15 Equational Theory - Example Two names are spelled nearly identically and have the same address It may be inferred that they are the same person Two social security numbers are the same but the names and addresses are totally different Could be the same person who moved Could be two different people and there is an error in the social security number Haiguang Li 01. Dec. 2011 16 A simplified rule in English Given two records, r1 and r2 IF the last name of r1 equals the last name of r2, AND the first names differ slightly, AND the address of r1 equals the address of r2 THEN r1 is equivalent to r2 Haiguang Li 01. Dec. 2011 17 The distance function A “distance function” is used to compare pieces of data (usually text) Apply “distance function” to data that “differ slightly” Select a threshold to capture obvious typographical errors. Impacts number of successful matches and number of false positives Haiguang Li 01. Dec. 2011 18 Examples of matched records SSN Name (First, Initial, Last) Address 334600443 334600443 Lisa Boardman Lisa Brown 144 Wars St. 144 Ward St. 525520001 525520001 Ramon Bonilla Raymond Bonilla 38 Ward St. 38 Ward St. 0 0 Diana D. Ambrosion Diana A. Dambrosion 40 Brik Church Av. 40 Brick Church Av. 789912345 879912345 Kathi Kason Kathy Kason 48 North St. 48 North St. 879912345 879912345 Kathy Kason Kathy Smith 48 North St. 48 North St. Haiguang Li 01. Dec. 2011 19 Building an equational theory The process of creating a good equational theory is similar to the process of creating a good knowledgebase for an expert system In complex problems, an expert’s assistance is needed to write the equational theory Haiguang Li 01. Dec. 2011 20 Transitive Closure In general, no single pass (i.e. no single key) will be sufficient to catch all matching records An attribute that appears first in the key has higher discriminating power than those appearing after them If an employee has two records in a DB with SSN 193456782 and 913456782, it’s unlikely they will fall under the same window Haiguang Li 01. Dec. 2011 21 Transitive Closure To increase the number of similar records merged Widen the scanning window size, w Execute several independent runs of the SNM Use a different key each time Use a relatively small window Call this the Multi-Pass approach Haiguang Li 01. Dec. 2011 22 Transitive Closure Each independent run of the Multi-Pass approach will produce a set of pairs of records Although one field in a record may be in error, another field may not Transitive closure can be applied to those pairs to be merged Haiguang Li 01. Dec. 2011 23 Multi-pass Matches Pass 1 (Lastname discriminates) KSNKAT48NRTH789 (Kathi Kason 789912345 ) KSNKAT48NRTH879 (Kathy Kason 879912345 ) Pass 2 (Firstname discriminates) KATKSN48NRTH789 (Kathi Kason 789912345 ) KATKSN48NRTH879 (Kathy Kason 879912345 ) Pass 3 (Address discriminates) 48NRTH879KSNKAT (Kathy Kason 879912345 ) 48NRTH879SMTKAT (Kathy Smith 879912345 ) Haiguang Li 01. Dec. 2011 24 Transitive Equality Example IF A implies B AND B implies C THEN A implies C From example: 789912345 Kathi Kason 48 North St. (A) 879912345 Kathy Kason 48 North St. (B) 879912345 Kathy Smith 48 North St. (C) Haiguang Li 01. Dec. 2011 25 Test Results Haiguang Li 01. Dec. 2011 26 Test Environment Test data was created by a database generator Names are randomly chosen from a list of 63000 real names The database generator provides a large number of parameters: size of the DB, percentage of duplicates, amount of error… Haiguang Li 01. Dec. 2011 27 Correct Duplicate Detection Haiguang Li 01. Dec. 2011 28 Time for each run Haiguang Li 01. Dec. 2011 29 Accuracy for each run Haiguang Li 01. Dec. 2011 30 Real-World Test Data was obtained from the Office of Children Administrative Research (OCAR) of the Department of Social and Health Services (State of Washington) OCAR’s goals How long do children stay in foster care? How many different homes do children typically stay in? Haiguang Li 01. Dec. 2011 31 OCAR’s Database Most of OCAR’s data is stored in one relation The DB contains 6,000,000 total records The DB grows by about 50,000 records per month Haiguang Li 01. Dec. 2011 32 Typical Problems in the DB Names are frequently misspelled SSN or birthdays are either missing or clearly wrong Case number often changes when the child’s family moves to another part of the state Some records use service provider names instead of the child’s No reliable unique identifier Haiguang Li 01. Dec. 2011 33 OCAR Equational Theory Keys for the independent runs Last Name, First Name, SSN, Case Number First Name, Last Name, SSN, Case Number Case Number, First Name, Last Name, SSN Haiguang Li 01. Dec. 2011 34 OCAR Results Haiguang Li 01. Dec. 2011 35 Incremental Merge/Purge w/ New Data Haiguang Li 01. Dec. 2011 36 Incremental Merge/Purge Lists are concatenated for first time processing Concatenating new data before reapplying the merge/purge process may be very expensive in both time and space An incremental merge/purge approach is needed: Prime Representatives method Haiguang Li 01. Dec. 2011 37 Prime-Representative: Definition A set of records extracted from each cluster of records used to represent the information in the cluster The “Cluster Centroid” or base element of equivalence class Haiguang Li 01. Dec. 2011 38 Prime-Representative creation Initially, no PR exists After the execution of the first merge/purge create clusters of similiar records Correct selection of PR from cluster impacts accuracy of results No PR can be the best selection for some clusters Haiguang Li 01. Dec. 2011 39 3 Strategies for Choosing PR Random Sample Select a sample of records at random from each cluster N-Latest Most recent elements entered in DB Syntactic Choose the largest or more complete record Haiguang Li 01. Dec. 2011 40 Important Assumption No data previously used to select each cluster’s PR will be deleted Deleted records could require restructuring of clusters (expensive) No changes in the rule-set will occur after the first increment of data is processed Substantial rule change could invalidate clusters. Haiguang Li 01. Dec. 2011 41 Results Cumulative running time for the Incremental Merge/Purge algorithm is higher than the classic algorithm PR selection methodology could improve cumulative running time Total running time of the Incremental Merge/Purge algorithm is always smaller Haiguang Li 01. Dec. 2011 42 Conclusion Haiguang Li 01. Dec. 2011 43 Cleansing of Data Sorted-Neighborhood Method is expensive due to the sorting phase the need for large windows for high accuracy Multiple passes with small windows followed by transitive closure improves accuracy and performance for level of accuracy increasing number of successful matches decreasing number of false positives Haiguang Li 01. Dec. 2011 44 Questions 1? 2 major reasons merging large databases becomes a difficult problem: The databases are heterogeneous The identifiers or strings differ in how they are represented within each DB Haiguang Li 01. Dec. 2011 45 Questions 2? The 3 steps in SNM are: Creation of key(s) Sorting records on this key Merge/Purge records Haiguang Li 01. Dec. 2011 46 Questions 3? 3 strategies for selecting a PR: Random Sample N-Latest Syntactic Haiguang Li 01. Dec. 2011 47 The End Thanks very much! Haiguang Li 01. Dec. 2011 48