SLIDES: Data Quality Assurance in Telecommunications Databases

advertisement
Data Quality Assurance in
Telecommunications Databases
C.-M. Chen , M. Cochinwala
{chungmin, munir}@research.telcordia.com
Applied Research
Telcordia Technologies
Morristown, NJ 07960
Outline
 Operation Support Systems
 Data Quality Issues
 Record Matching Problem & Techniques
 Example Applications
 References
Doc Name – 2
Operation Support Systems
 Telecom carriers or service providers use multiple
Operation Support Systems (OSS) for
– network configuration
– network engineering
– service provisioning
– network performance monitoring
– customer care & billing, etc.
 Each OSS may maintain its own database using a DBMS
Doc Name – 3
OSS Databases
 The databases may overlap on the network entities they
describe
 The OSSs may use different model/schemas to describe
the same entity
Doc Name – 4
Data Quality Issues
 Corrupted data: data do not correctly reflect the
properties/specifications of the modeled entities
 Inaccurate data: discrepancy between the data and the real-world
entity they modeled (e.g. outdated information)
 Inconsistent data:
– records in different OSS databases referencing the same real-world entity
do not agree on the attribute values of the entity
– an entity is represented by more than one record in a single OSS database
Doc Name – 5
Example: DSL Deployment
 corrupted data: many of the Central Offices (CO) do not have correct
location information (longitude,latitude)
 inaccurate data: the database says a port or bandwidth is available at
the CO while indeed it is not (or vice versa)
 inconsistent data: Service Provisioning database says a line is
available while Network Configuration database says otherwise
(which one to believe?)
Doc Name – 6
Why is data quality an issue
 corrupted data disrupt/delay business deployment
 inaccurate data impede decision making
 inconsistent data degrade operation efficiency
 bottom line:
– increased cost
– decreased revenue
– customer dissatisfaction
Doc Name – 7
Cost of Data Quality Assurance
 To date, practice of data quality assurance is mostly manual-driven and
labor-intensive
 “ … it costs a typical telephone company in the mid-1990s $2 per line
per year to keep the network database reasonably accurate.”
 Now imagine the scale and complexity of the networks brought by the
IP and wireless technologies and their effects on the cost of data
quality
  (Semi-) automatic tools are needed
Doc Name – 8
Data Quality Assurance Techniques
 data correctness and accuracy:
– database validation and correction via autodiscovery tools
 data consistency (reconciliation)
– record matching (record linkage): identify records that correspond to the
same real-world entity
Doc Name – 9
Data Accuracy
 to assess and correct the discrepancy between data in the database and
the modeled entities in the real-world
 autodiscovery tools:
– automatically probe network elements for configuration parameters
– reconstruct the network connection topology at different layers (from
physical to application layers)
 how to efficiently validate the data stored in the databases against the
auto-discovered information?
– sampling
– graph algorithms
Doc Name – 10
Record Matching Techniques
[Cochinwala01b]
 Problem Definition
 Record Matching Phases
 Quality Metrics
 Searching Techniques (to reduce search space)
 Matching Decision Rules (to determine matched pairs)
Doc Name – 11
The Record Matching Problem
Goal: To identify records in the same or different databases that
correspond to the same real-world entity.
 two records that correspond to the same entity are called a matched
pair
 in a relational table, no two records can hold identical values on all
attributes  a matched pair consists of two different records that refer
to the same entity
 the same entity may be represented differently in different
tables/databases, with no agreement on key value(s)
Doc Name – 12
Three Phases of Reconciliation
 Data Preparation: scrubbing and cleansing
 Searching: reducing search space
 Matching: finding matched pairs
Doc Name – 13
Data Preparation
 Parsing
 Data transformation
 Standardization
Doc Name – 14
Searching Problem
Consider reconciling tables A and B
 search space: P = A  B (Cartesian product)
 M: matched pairs
 U: unmatched pairs
P=MU
Problem: searching P requires |A||B| comparisons
Goal: reduce the search space to a smaller P’  P, such that M  P’.
Doc Name – 15
Searching Techniques
Heuristic: potentially matched records should fall into the same cluster
 Relational join operators: nest-loop join, merge-sort join, hash join,
band join [Hernadez96]
 Blocking: Soundex Code, NYSIIS [Newcombe88]
 Windowing (sorted neighborhood) [Hernadez98]
 Priority queue [MongeElkan97]
 Multi-pass merge-sort joins [Hernadez98]
Doc Name – 16
String Matching Techniques
 Edit distance: counts of insert, delete, replace [Manber89]
 Smith-Waterman: dynamic programming to find minimum edit
distance [SmithWaterman81]
Doc Name – 17
Record Matching Techniques
Given a (reduced) space of pairs, how to determine if a pair
(x,y) is a match, i.e., records x and y refer to the same realworld entity ?
 Probabilistic approaches [Newcombe59]
– Bayes test for minimum error
– Bayes test for minimum cost
 Non-probabilistic approaches
– Supervised Learning (to generate decision rules) [Cochinwala01a]
– Equation theory model [Hernadez98]
– Distance-based techniques [DeySarkarDe98]
Doc Name – 18
Bayes Test for Minimum Error
Cartesian product P = A  B = M  U
For each record (a1, a2, … , an , b1, b2, … , bn)  P, define
x = (x1, x2, …, xn), where
xi = 1 if ai = bi
0 if ai  bi
a-prior prob
Class M:
M
Class U:
U
conditional density function
p(x|M)
p(x|U)
Unconditional density function:
p(x) = M p(x|M) + U p(x|U)
Doc Name – 19
Bayes Test for Minimum Error (cont.)
Assume M, U, p(x|M), and p(x|U) are knowm
Basyes theorem: the posteriori probability
p(M|x) = M p(x|M) / (M p(x|M) + U p(x|U))
Likelihood Ratio Decision Rule:
x is decided to belong to M iff
p(M|x)  p(U|x) iff
M p(x|M)  U p(x|U) iff
l(x) = (p(x|M) / p(x|U))  (U / M )
The test gives min probability of error (misclassification)
Doc Name – 20
Bayes Test for Minimum Cost
C i,j : cost of a class j record being (mis)-classified to class i
Conditional cost of x being classified to class M given x is:
cM(x) = cMM p(M|x) + cMU p(U|x)
Similarly,
cU(x) = cUM p(M|x) + cUU p(U|x)
Likelihood Ratio Decision Rule:
x is decided to belong to M iff
cM(x)  cU(x) iff
l(x) = (p(x|M) / p(x|U))  ((cMU - cUU )U / (cUM - cMM )M )
The test gives min cost
Doc Name – 21
Supervised Learning
 Take a small sample S form A  B
 For every record s S, label it as M (matched) or U
(unmatched)
 Select a predictive model, along with associated decision rules,
that best discriminates between classes M and U for records in
S
 Apply the selected model to classify all records in A  B
Doc Name – 22
Arroyo: a data reconciliation tool [Cochinwala01a]
 matches customer records across two databases:
– wireline database: 860,000 records
– wireless database: 1,300,000 records
 methodology
– pre-processing (application dependent)
– parameter space definition (application dependent)
– matching rule generation & pruning (learning, model selection)
– matching
Doc Name – 23
Preprocessing (Application Dependent)
 elimination of stop-words
– blanks, special characters, ...
– “Chung-Min Chen” becomes “ChungMin Chen”
 word re-ordering
– “20 PO BOX” becomes “PO BOX 20”
 word substitution
– “St.” becomes “Street”
Doc Name – 24
Parameter Space Definition (Application Dependent)
Six parameters used:
 edit distance between the Name fields
 edit distance between the Address fields
 length of Name field in Wireline
 length of Name field in Wireless
 length of Address field in Wireline
 length of Address field in Wireless
Doc Name – 25
Matching Rule Generation
1. Learning Set Generation
 select records that contain the word “Falcon” in Name field
– Wireless - 241 records
– Wireline - 883 records
 Choose sample first from the database that is the image of the higher
degree “onto”-mapping. For example, Wireline  Wireless has a higher
“onto” degree than “Wireless  Wireline”
Doc Name – 26
Matching Rule Generation
2. Learning Set Labeling
 identify prospective matches
– match if the edit distances on Name and Address fields are both less
than 3
– (30% false matches but no miss of true matches)
 verdict assignment
– each pair that is not a prospective match is assigned “unmatch”
– each prospective match pair is manually examined and labeled as
“match”, “unmatch”, or “ambigious” (labor intensive)
Doc Name – 27
Matching Rule Generation
3. Model Selection
 Input: 241 883 = 212,803 pairs with 7 fields (parameters)
Label (verdict)
Edit distance between Name fields
Length of Name field in Wireless
Length of Address in Wireless
Edit distance between Address fields
Length of Name field in Wireline
Length of Address in Wireline
 Three model tried, with cross-validation (half sample to build the model,
half sample to estimate the error rate)
Model
CART
Linear Discriminant Analysis
Vector Quantization
Error Rate (average of 50 runs)
5.1%
5.3%
9.4%
Doc Name – 28
Matching Rule Generation
3. Model Selection (cont.)
Model
CART
Linear Discriminant Analysis
Vector Quantization
Error Rate (average of 50 runs)
5.1%
5.3%
9.4%
 LDA could have inefficient rules for database evaluation, e.g.:
2*(Address edit distance) + 1.3*(Name length on Wireless) < 3  match
– an index on Name is useless
 CART rules can be evaluated efficiently using existing database indices
Doc Name – 29
Matching Rule Pruning
 determines which parameters (rules) to drop to reduce tree
complexity while retaining tree quality
– maximize delta of a cost function
– dynamic programming with a threshold
 complexity of tree T:
C(T) =  C(p) /  C(q) ,
pT
q S
C(p) = complexity to compute the value of parameter p
e.g.
C(edit distance between fields A and B) = avg_length(A)* avg_length(B)
Note: 0 < C(T)  1
Doc Name – 30
Matching Rule Pruning (cont.)
C(edit distance between fields A and B) = avg_length(A)* avg_length(B)
Parameter
Address length (wireless)
Address length (wireline)
Name length (wireless)
Name length (wireline)
Edit distance on Addresses
Edit distance on Names
Avg length
6.88
6.72
10.87
10.70
Complexity
6.88
6.72
10.87
10.70
6.88*6.72 = 46.23
10.87*10.7 = 116.31
Doc Name – 31
Matching Rule Pruning (cont.)
 misclassification rate of T:
M(T) = # misclassified pairs / # pairs in the test set,
Note: 0  M(T)  1
 Cost of Tree T:
J(T) = w1*C(T) + w2*M(T),
where w1 and w2 are weights. (We used w1 = w2 = 1)
 Let T* = T - {all rues involving parameter p), define
J(T) = J(T) - J(T*)
= (C(T)-C(T*)) + (M(T) - M(T*))
= C(T) + M(T)
 Goal: Find a set of parameters to drop to maximize J(T)
Doc Name – 32
Matching Rule Pruning (cont.)
 Dynamic programming
– find a parameter p1 that maximizes J(T)
– fix p1 and find a second parameter p2 such that {p1,p2} maximizes
J(T)
– repeat the step until we reach a set of parameters {p1,p2,…,pk} such
that J(T) is less than a threshold value
Parameter
C(T)
Edit dist on Name
0.5742
Edit dist on Address
0.2338
Len. on Name (Wireless) 0.0566
Len. on Name (Wireline) 0.0541
Len. on Address (Wireless)
0.0266
Len. On Address (Wireline)
0.0269
M(T)
0.5882
J(T)
-0.0140
-0.1840
-0.0610
-0.0160
0.0498
-0.0044
0.0381
0.0347
-0.0081
0.0339
-0.0070
Doc Name – 33
Matching Rule Pruning (cont.)
2nd Parameter
C(T)
Len. on Name (Wireline) 0.0541
Len. on Address (Wireless)
0.0275
Len. On Address (Wireline)
0.0248
M(T)
-0.0088
J(T)
0.0453
0.0347
-0.0072
0.0339
-0.0091
...
Dual parameters pruning
Doc Name – 34
Matching Rule Pruning (cont.)
 Reduce from 6 parameters (original tree) to 2 parameters (final
tree)
root
Address edit distance > 1.5
Address edit distance < 1.5
matched
Wireless Name length < 11.5
unmatched
Wireless Name length > 11.5
ambiguous
 index on Name field can be used to evaluate the tree
Doc Name – 35
CART Matching Result
x/y
Matched
Unmatched
Ambiguous
Matched
0.965
0.020
0.015
Unmatched
0.014
0.956
0.030
Ambiguous
0.03
0.18
0.79
Prob (predicted as x | given y)
 approximately 10% error rate
 reduced 50% computation cost for matching
 about 47% more correctly matched records over a leading
commercial product
Doc Name – 36
Major Telephone Operating Company
Problem:
– Mergers and acquisitions of wireless companies resulted in the
RBOC’s inability to determine common customers among wireline
and wireless businesses
– Customer databases for each business unit use different schema
and contain many quality problems
– RBOC’s experience with commercial vendor’s data reconciliation
tool was unsatisfactory
 Solution:
– Use small manually-verified data samples (~100 records) to
determine appropriate matching rules
– Use machine learning to prune the rules for efficient analysis of the
large dataset
– Resulted in 30% more correct matches than the commercial tool
Doc Name – 37
Large Media Conglomerate
Problem:
– Company provides magazines to wholesalers who in turn provide
magazines to resalers for distribution
– Reconciliation of wholesaler and retailer databases would make it
easier to track where gaps in reporting are occurring
– Identify ‘bad’ retailers
 Solution:
– Group by primary keys
– Match by secondary keys
– E.g. 3000 C.V.S. Pharmacies are grouped and compared by zip
code and street address – identify ‘bad’ pharmacies
Doc Name – 38
International Government
Problem:
– Reconcile vital taxpayer data from several different sources
– Known problems include record duplication, address mismatches,
address obsolescence, distributed responsibility for database
accuracy and updates
– Identify causes for mistakes
 Solution:
– Improve the process flows and architecture to allow for rapid
modification of pre-processing rules and matching rules
– Detection and classification of likely causes of duplication
– Analysis and improvements reduced number of records that
required manual verification
Doc Name – 39
ILEC-CLEC Billing Reconciliation
 Problem
– ILECs charge CLECs for use of network resources
– Verification of actual usage vs. charging
 E.g customer changing providers
– Identify actual usage and send verification to ILEC
– Resource identification in ILEC and CLEC is different
 Solution
– Check charges in bill against actual usage
– Common identification of resources (matching table)
– Solution has only been implemented for access line charge
Doc Name – 40
Future Directions
 Reduction of manual work
– Type based standardization, rules
– Rules and Parameter pruning?
 Database and tool
– Rule execution plan
 E.g. string length before edit distance
– Sharing/maintenance of index across steps in process
– Extending matching to structures (trees, graphs – circuits)
– Matching across layers in networks (OSI layers)
– How often to discover/audit network? – sampling techniques
Doc Name – 41
References
 [Cochinwala01a ]M Cochinwala, V. Kurien, G. Lalk and D. Shasha, “Efficient data
reconciliation”, Information Sciences, 137 (1-4) , 2001.
 [Cochinwala01b] M. Cochinwala, S. Dalal, A. Elmagarmid and V. Verykios, “Record
matching: past, present and future”, submitted for publication.
 [DeySarkarDe]D. Dey, S. Sarkar and P. De, “Entity matching in heterogeneous databases: a
distance-based decision model”, 31st Hawaii int. Conf. On System Sciences, 1998.
 [Hernadez96] M. Hernadez, “A generalization of band joins and the merge/purge problem”,
Ph.D. Thesis, CS Dept, Columbia University, 1996.
 [Hernadez98] M. Hernadez and S. Stolfo, “Real-world data is dirty: data cleansing and the
merger/purge problem”, Data Mining and Knowledge Discovery, 2 (1), 1998.
 [Manber89] U. Manber, “Introductions to Algorithms”, Addiso_Wesley, 1989.
 [MongeElkan97] A. Monge and C. Elkan, “An efficient domain-independent algorithm for
detecting approximately duplicate database records”, SIGMOD DMKD Workshop, 1997.
 [Newcombe59] H. Newcombe, J. Kennedy, S. Axford and A. James, “Automatic linkage of
vital records”, Science 130(3381), 1959.
 [Newcombe88]H. Newcombe, “Handbook of Record Linkage”, Oxford University Press,
1988.
 [SmithWaterman81] T. Smith and M. Waterman, “Identification of common molecular
subsequences”, J. Mol. Biol. 147, 1981.
Doc Name – 42
Download