Real-World Data Is Dirty

advertisement
Real-World Data Is Dirty
Data Cleansing and the
Merge/Purge Problem
Hernandez & Stolfo: Columbia University - 1998
Class Presentation by Haiguang Li, 01. Dec 2011
Haiguang Li 01. Dec. 2011
1
TOPICS
Introduction
A Basic Data Cleansing Solution
Test & Real World Results
Incremental Merge Purge w/ New Data
Conclusion
Recap
Haiguang Li 01. Dec. 2011
2
Introduction
Haiguang Li 01. Dec. 2011
3
The problem:
Some corporations acquire large amounts of
information every month
The data is stored in many large databases
(DB)
These databases may be heterogeneous

Variations in schema
The data may be represented differently
across the various datasets
Data in these DB may simply be inaccurate
Haiguang Li 01. Dec. 2011
4
Requirement of the analysis
The data mining needs to be done



Quickly
Efficiently
Accurately
Haiguang Li 01. Dec. 2011
5
Examples of real-world applications
Credit card companies


Assess risk of potential new customers
Find false identities
Match disparate records concerning a
customer


Mass Marketing companies
Government agencies
Haiguang Li 01. Dec. 2011
6
A Basic Data Cleansing Solution
Haiguang Li 01. Dec. 2011
7
Duplicate Elimination
Sorted-Neighborhood Method (SNM)
This is done in three phases



Create a Key for each record
Sort records on this key
Merge/Purge records
Haiguang Li 01. Dec. 2011
8
SNM: Create key
Compute a key for each record by
extracting relevant fields or portions of
fields
Example:
First
Last
Address
ID
Key
Sal
Stolfo
123 First Street
45678987
STLSAL123FRST456
Haiguang Li 01. Dec. 2011
9
SNM: Sort Data
Sort the records in the data list using
the key in step 1
This can be very time consuming


O(NlogN) for a good algorithm,
O(N2) for a bad algorithm
Haiguang Li 01. Dec. 2011
10
SNM: Merge records
Move a fixed size
window through the
sequential list of
records.
This limits the
comparisons to the
records in the
window
Haiguang Li 01. Dec. 2011
11
SNM: Considerations
What is the optimal window size while


Maximizing accuracy
Minimizing computational cost
Execution time for large DB will be
bound by


Disk I/O
Number of passes over the data set
Haiguang Li 01. Dec. 2011
12
Selection of Keys
The effectiveness of the SNM highly
depends on the key selected to sort the
records
A key is defined to be a sequence of a
subset of attributes
Keys must provide sufficient
discriminating power
Haiguang Li 01. Dec. 2011
13
Example of Records and Keys
First
Last
Address
ID
Key
Sal
Stolfo
123 First Street
45678987
STLSAL123FRST456
Sal
Stolfo
123 First Street
45678987
STLSAL123FRST456
Sal
Stolpho
123 First Street
45678987
STLSAL123FRST456
Sal
Stiles
123 Forest Street
45654321
STLSAL123FRST456
Haiguang Li 01. Dec. 2011
14
Equational Theory
The comparison during the merge
phase is an inferential process
Compares much more information than
simply the key
The more information there is, the
better inferences can be made
Haiguang Li 01. Dec. 2011
15
Equational Theory - Example
Two names are spelled nearly identically and
have the same address

It may be inferred that they are the same person
Two social security numbers are the same but
the names and addresses are totally different


Could be the same person who moved
Could be two different people and there is an error
in the social security number
Haiguang Li 01. Dec. 2011
16
A simplified rule in English
Given two records, r1 and r2
IF the last name of r1 equals the last name of r2,
AND the first names differ slightly,
AND the address of r1 equals the address of r2
THEN
r1 is equivalent to r2
Haiguang Li 01. Dec. 2011
17
The distance function
A “distance function” is used to
compare pieces of data (usually text)
Apply “distance function” to data that
“differ slightly”
Select a threshold to capture obvious
typographical errors.

Impacts number of successful matches and
number of false positives
Haiguang Li 01. Dec. 2011
18
Examples of matched records
SSN
Name (First, Initial, Last)
Address
334600443
334600443
Lisa Boardman
Lisa Brown
144 Wars St.
144 Ward St.
525520001
525520001
Ramon Bonilla
Raymond Bonilla
38 Ward St.
38 Ward St.
0
0
Diana D. Ambrosion
Diana A. Dambrosion
40 Brik Church Av.
40 Brick Church Av.
789912345
879912345
Kathi Kason
Kathy Kason
48 North St.
48 North St.
879912345
879912345
Kathy Kason
Kathy Smith
48 North St.
48 North St.
Haiguang Li 01. Dec. 2011
19
Building an equational theory
The process of creating a good
equational theory is similar to the
process of creating a good knowledgebase for an expert system
In complex problems, an expert’s
assistance is needed to write the
equational theory
Haiguang Li 01. Dec. 2011
20
Transitive Closure
In general, no single pass (i.e. no single key)
will be sufficient to catch all matching records
An attribute that appears first in the key has
higher discriminating power than those
appearing after them

If an employee has two records in a DB with SSN
193456782 and 913456782, it’s unlikely they will
fall under the same window
Haiguang Li 01. Dec. 2011
21
Transitive Closure
To increase the number of similar
records merged


Widen the scanning window size, w
Execute several independent runs of the
SNM
 Use a different key each time
 Use a relatively small window
 Call this the Multi-Pass approach
Haiguang Li 01. Dec. 2011
22
Transitive Closure
Each independent run of the Multi-Pass
approach will produce a set of pairs of
records
Although one field in a record may be in
error, another field may not
Transitive closure can be applied to
those pairs to be merged
Haiguang Li 01. Dec. 2011
23
Multi-pass Matches
Pass 1 (Lastname discriminates)
KSNKAT48NRTH789 (Kathi Kason 789912345 )
KSNKAT48NRTH879 (Kathy Kason 879912345 )
Pass 2 (Firstname discriminates)
KATKSN48NRTH789 (Kathi Kason 789912345 )
KATKSN48NRTH879 (Kathy Kason 879912345 )
Pass 3 (Address discriminates)
48NRTH879KSNKAT (Kathy Kason 879912345 )
48NRTH879SMTKAT (Kathy Smith 879912345 )
Haiguang Li 01. Dec. 2011
24
Transitive Equality Example
IF A implies B
AND B implies C
THEN A implies C
From example:
789912345 Kathi Kason 48 North St. (A)
879912345 Kathy Kason 48 North St. (B)
879912345 Kathy Smith 48 North St. (C)
Haiguang Li 01. Dec. 2011
25
Test Results
Haiguang Li 01. Dec. 2011
26
Test Environment
Test data was created by a database
generator

Names are randomly chosen from a list of 63000
real names
The database generator provides a large
number of parameters:



size of the DB,
percentage of duplicates,
amount of error…
Haiguang Li 01. Dec. 2011
27
Correct Duplicate Detection
Haiguang Li 01. Dec. 2011
28
Time for each run
Haiguang Li 01. Dec. 2011
29
Accuracy for each run
Haiguang Li 01. Dec. 2011
30
Real-World Test
Data was obtained from the Office of
Children Administrative Research (OCAR)
of the Department of Social and Health
Services (State of Washington)
OCAR’s goals


How long do children stay in foster care?
How many different homes do children
typically stay in?
Haiguang Li 01. Dec. 2011
31
OCAR’s Database
Most of OCAR’s data is stored in one
relation
The DB contains 6,000,000 total records
The DB grows by about 50,000 records
per month
Haiguang Li 01. Dec. 2011
32
Typical Problems in the DB
Names are frequently misspelled
SSN or birthdays are either missing or clearly
wrong
Case number often changes when the child’s
family moves to another part of the state
Some records use service provider names
instead of the child’s
No reliable unique identifier
Haiguang Li 01. Dec. 2011
33
OCAR Equational Theory
Keys for the independent runs



Last Name, First Name, SSN, Case Number
First Name, Last Name, SSN, Case Number
Case Number, First Name, Last Name, SSN
Haiguang Li 01. Dec. 2011
34
OCAR Results
Haiguang Li 01. Dec. 2011
35
Incremental Merge/Purge w/ New
Data
Haiguang Li 01. Dec. 2011
36
Incremental Merge/Purge
Lists are concatenated for first time
processing
Concatenating new data before reapplying
the merge/purge process may be very
expensive in both time and space
An incremental merge/purge approach is
needed: Prime Representatives method
Haiguang Li 01. Dec. 2011
37
Prime-Representative: Definition
A set of records extracted from each
cluster of records used to represent the
information in the cluster
The “Cluster Centroid” or base element
of equivalence class
Haiguang Li 01. Dec. 2011
38
Prime-Representative creation
Initially, no PR exists
After the execution of the first
merge/purge create clusters of similiar
records
Correct selection of PR from cluster
impacts accuracy of results
No PR can be the best selection for
some clusters
Haiguang Li 01. Dec. 2011
39
3 Strategies for Choosing PR
Random Sample

Select a sample of records at random from
each cluster
N-Latest

Most recent elements entered in DB
Syntactic

Choose the largest or more complete
record
Haiguang Li 01. Dec. 2011
40
Important Assumption
No data previously used to select each
cluster’s PR will be deleted

Deleted records could require restructuring
of clusters (expensive)
No changes in the rule-set will occur
after the first increment of data is
processed

Substantial rule change could invalidate
clusters.
Haiguang Li 01. Dec. 2011
41
Results
Cumulative running time for the
Incremental Merge/Purge algorithm is
higher than the classic algorithm
PR selection methodology could
improve cumulative running time
Total running time of the Incremental
Merge/Purge algorithm is always
smaller
Haiguang Li 01. Dec. 2011
42
Conclusion
Haiguang Li 01. Dec. 2011
43
Cleansing of Data
Sorted-Neighborhood Method is expensive
due to


the sorting phase
the need for large windows for high accuracy
Multiple passes with small windows followed
by transitive closure improves accuracy and
performance for level of accuracy


increasing number of successful matches
decreasing number of false positives
Haiguang Li 01. Dec. 2011
44
Questions 1?
2 major reasons merging large
databases becomes a difficult problem:


The databases are heterogeneous
The identifiers or strings differ in how they
are represented within each DB
Haiguang Li 01. Dec. 2011
45
Questions 2?
The 3 steps in SNM are:



Creation of key(s)
Sorting records on this key
Merge/Purge records
Haiguang Li 01. Dec. 2011
46
Questions 3?
3 strategies for selecting a PR:



Random Sample
N-Latest
Syntactic
Haiguang Li 01. Dec. 2011
47
The End
Thanks very much!
Haiguang Li 01. Dec. 2011
48
Download