Data Quality Assurance in Telecommunications Databases C.-M. Chen , M. Cochinwala {chungmin, munir}@research.telcordia.com Applied Research Telcordia Technologies Morristown, NJ 07960 Outline Operation Support Systems Data Quality Issues Record Matching Problem & Techniques Example Applications References Doc Name – 2 Operation Support Systems Telecom carriers or service providers use multiple Operation Support Systems (OSS) for – network configuration – network engineering – service provisioning – network performance monitoring – customer care & billing, etc. Each OSS may maintain its own database using a DBMS Doc Name – 3 OSS Databases The databases may overlap on the network entities they describe The OSSs may use different model/schemas to describe the same entity Doc Name – 4 Data Quality Issues Corrupted data: data do not correctly reflect the properties/specifications of the modeled entities Inaccurate data: discrepancy between the data and the real-world entity they modeled (e.g. outdated information) Inconsistent data: – records in different OSS databases referencing the same real-world entity do not agree on the attribute values of the entity – an entity is represented by more than one record in a single OSS database Doc Name – 5 Example: DSL Deployment corrupted data: many of the Central Offices (CO) do not have correct location information (longitude,latitude) inaccurate data: the database says a port or bandwidth is available at the CO while indeed it is not (or vice versa) inconsistent data: Service Provisioning database says a line is available while Network Configuration database says otherwise (which one to believe?) Doc Name – 6 Why is data quality an issue corrupted data disrupt/delay business deployment inaccurate data impede decision making inconsistent data degrade operation efficiency bottom line: – increased cost – decreased revenue – customer dissatisfaction Doc Name – 7 Cost of Data Quality Assurance To date, practice of data quality assurance is mostly manual-driven and labor-intensive “ … it costs a typical telephone company in the mid-1990s $2 per line per year to keep the network database reasonably accurate.” Now imagine the scale and complexity of the networks brought by the IP and wireless technologies and their effects on the cost of data quality (Semi-) automatic tools are needed Doc Name – 8 Data Quality Assurance Techniques data correctness and accuracy: – database validation and correction via autodiscovery tools data consistency (reconciliation) – record matching (record linkage): identify records that correspond to the same real-world entity Doc Name – 9 Data Accuracy to assess and correct the discrepancy between data in the database and the modeled entities in the real-world autodiscovery tools: – automatically probe network elements for configuration parameters – reconstruct the network connection topology at different layers (from physical to application layers) how to efficiently validate the data stored in the databases against the auto-discovered information? – sampling – graph algorithms Doc Name – 10 Record Matching Techniques [Cochinwala01b] Problem Definition Record Matching Phases Quality Metrics Searching Techniques (to reduce search space) Matching Decision Rules (to determine matched pairs) Doc Name – 11 The Record Matching Problem Goal: To identify records in the same or different databases that correspond to the same real-world entity. two records that correspond to the same entity are called a matched pair in a relational table, no two records can hold identical values on all attributes a matched pair consists of two different records that refer to the same entity the same entity may be represented differently in different tables/databases, with no agreement on key value(s) Doc Name – 12 Three Phases of Reconciliation Data Preparation: scrubbing and cleansing Searching: reducing search space Matching: finding matched pairs Doc Name – 13 Data Preparation Parsing Data transformation Standardization Doc Name – 14 Searching Problem Consider reconciling tables A and B search space: P = A B (Cartesian product) M: matched pairs U: unmatched pairs P=MU Problem: searching P requires |A||B| comparisons Goal: reduce the search space to a smaller P’ P, such that M P’. Doc Name – 15 Searching Techniques Heuristic: potentially matched records should fall into the same cluster Relational join operators: nest-loop join, merge-sort join, hash join, band join [Hernadez96] Blocking: Soundex Code, NYSIIS [Newcombe88] Windowing (sorted neighborhood) [Hernadez98] Priority queue [MongeElkan97] Multi-pass merge-sort joins [Hernadez98] Doc Name – 16 String Matching Techniques Edit distance: counts of insert, delete, replace [Manber89] Smith-Waterman: dynamic programming to find minimum edit distance [SmithWaterman81] Doc Name – 17 Record Matching Techniques Given a (reduced) space of pairs, how to determine if a pair (x,y) is a match, i.e., records x and y refer to the same realworld entity ? Probabilistic approaches [Newcombe59] – Bayes test for minimum error – Bayes test for minimum cost Non-probabilistic approaches – Supervised Learning (to generate decision rules) [Cochinwala01a] – Equation theory model [Hernadez98] – Distance-based techniques [DeySarkarDe98] Doc Name – 18 Bayes Test for Minimum Error Cartesian product P = A B = M U For each record (a1, a2, … , an , b1, b2, … , bn) P, define x = (x1, x2, …, xn), where xi = 1 if ai = bi 0 if ai bi a-prior prob Class M: M Class U: U conditional density function p(x|M) p(x|U) Unconditional density function: p(x) = M p(x|M) + U p(x|U) Doc Name – 19 Bayes Test for Minimum Error (cont.) Assume M, U, p(x|M), and p(x|U) are knowm Basyes theorem: the posteriori probability p(M|x) = M p(x|M) / (M p(x|M) + U p(x|U)) Likelihood Ratio Decision Rule: x is decided to belong to M iff p(M|x) p(U|x) iff M p(x|M) U p(x|U) iff l(x) = (p(x|M) / p(x|U)) (U / M ) The test gives min probability of error (misclassification) Doc Name – 20 Bayes Test for Minimum Cost C i,j : cost of a class j record being (mis)-classified to class i Conditional cost of x being classified to class M given x is: cM(x) = cMM p(M|x) + cMU p(U|x) Similarly, cU(x) = cUM p(M|x) + cUU p(U|x) Likelihood Ratio Decision Rule: x is decided to belong to M iff cM(x) cU(x) iff l(x) = (p(x|M) / p(x|U)) ((cMU - cUU )U / (cUM - cMM )M ) The test gives min cost Doc Name – 21 Supervised Learning Take a small sample S form A B For every record s S, label it as M (matched) or U (unmatched) Select a predictive model, along with associated decision rules, that best discriminates between classes M and U for records in S Apply the selected model to classify all records in A B Doc Name – 22 Arroyo: a data reconciliation tool [Cochinwala01a] matches customer records across two databases: – wireline database: 860,000 records – wireless database: 1,300,000 records methodology – pre-processing (application dependent) – parameter space definition (application dependent) – matching rule generation & pruning (learning, model selection) – matching Doc Name – 23 Preprocessing (Application Dependent) elimination of stop-words – blanks, special characters, ... – “Chung-Min Chen” becomes “ChungMin Chen” word re-ordering – “20 PO BOX” becomes “PO BOX 20” word substitution – “St.” becomes “Street” Doc Name – 24 Parameter Space Definition (Application Dependent) Six parameters used: edit distance between the Name fields edit distance between the Address fields length of Name field in Wireline length of Name field in Wireless length of Address field in Wireline length of Address field in Wireless Doc Name – 25 Matching Rule Generation 1. Learning Set Generation select records that contain the word “Falcon” in Name field – Wireless - 241 records – Wireline - 883 records Choose sample first from the database that is the image of the higher degree “onto”-mapping. For example, Wireline Wireless has a higher “onto” degree than “Wireless Wireline” Doc Name – 26 Matching Rule Generation 2. Learning Set Labeling identify prospective matches – match if the edit distances on Name and Address fields are both less than 3 – (30% false matches but no miss of true matches) verdict assignment – each pair that is not a prospective match is assigned “unmatch” – each prospective match pair is manually examined and labeled as “match”, “unmatch”, or “ambigious” (labor intensive) Doc Name – 27 Matching Rule Generation 3. Model Selection Input: 241 883 = 212,803 pairs with 7 fields (parameters) Label (verdict) Edit distance between Name fields Length of Name field in Wireless Length of Address in Wireless Edit distance between Address fields Length of Name field in Wireline Length of Address in Wireline Three model tried, with cross-validation (half sample to build the model, half sample to estimate the error rate) Model CART Linear Discriminant Analysis Vector Quantization Error Rate (average of 50 runs) 5.1% 5.3% 9.4% Doc Name – 28 Matching Rule Generation 3. Model Selection (cont.) Model CART Linear Discriminant Analysis Vector Quantization Error Rate (average of 50 runs) 5.1% 5.3% 9.4% LDA could have inefficient rules for database evaluation, e.g.: 2*(Address edit distance) + 1.3*(Name length on Wireless) < 3 match – an index on Name is useless CART rules can be evaluated efficiently using existing database indices Doc Name – 29 Matching Rule Pruning determines which parameters (rules) to drop to reduce tree complexity while retaining tree quality – maximize delta of a cost function – dynamic programming with a threshold complexity of tree T: C(T) = C(p) / C(q) , pT q S C(p) = complexity to compute the value of parameter p e.g. C(edit distance between fields A and B) = avg_length(A)* avg_length(B) Note: 0 < C(T) 1 Doc Name – 30 Matching Rule Pruning (cont.) C(edit distance between fields A and B) = avg_length(A)* avg_length(B) Parameter Address length (wireless) Address length (wireline) Name length (wireless) Name length (wireline) Edit distance on Addresses Edit distance on Names Avg length 6.88 6.72 10.87 10.70 Complexity 6.88 6.72 10.87 10.70 6.88*6.72 = 46.23 10.87*10.7 = 116.31 Doc Name – 31 Matching Rule Pruning (cont.) misclassification rate of T: M(T) = # misclassified pairs / # pairs in the test set, Note: 0 M(T) 1 Cost of Tree T: J(T) = w1*C(T) + w2*M(T), where w1 and w2 are weights. (We used w1 = w2 = 1) Let T* = T - {all rues involving parameter p), define J(T) = J(T) - J(T*) = (C(T)-C(T*)) + (M(T) - M(T*)) = C(T) + M(T) Goal: Find a set of parameters to drop to maximize J(T) Doc Name – 32 Matching Rule Pruning (cont.) Dynamic programming – find a parameter p1 that maximizes J(T) – fix p1 and find a second parameter p2 such that {p1,p2} maximizes J(T) – repeat the step until we reach a set of parameters {p1,p2,…,pk} such that J(T) is less than a threshold value Parameter C(T) Edit dist on Name 0.5742 Edit dist on Address 0.2338 Len. on Name (Wireless) 0.0566 Len. on Name (Wireline) 0.0541 Len. on Address (Wireless) 0.0266 Len. On Address (Wireline) 0.0269 M(T) 0.5882 J(T) -0.0140 -0.1840 -0.0610 -0.0160 0.0498 -0.0044 0.0381 0.0347 -0.0081 0.0339 -0.0070 Doc Name – 33 Matching Rule Pruning (cont.) 2nd Parameter C(T) Len. on Name (Wireline) 0.0541 Len. on Address (Wireless) 0.0275 Len. On Address (Wireline) 0.0248 M(T) -0.0088 J(T) 0.0453 0.0347 -0.0072 0.0339 -0.0091 ... Dual parameters pruning Doc Name – 34 Matching Rule Pruning (cont.) Reduce from 6 parameters (original tree) to 2 parameters (final tree) root Address edit distance > 1.5 Address edit distance < 1.5 matched Wireless Name length < 11.5 unmatched Wireless Name length > 11.5 ambiguous index on Name field can be used to evaluate the tree Doc Name – 35 CART Matching Result x/y Matched Unmatched Ambiguous Matched 0.965 0.020 0.015 Unmatched 0.014 0.956 0.030 Ambiguous 0.03 0.18 0.79 Prob (predicted as x | given y) approximately 10% error rate reduced 50% computation cost for matching about 47% more correctly matched records over a leading commercial product Doc Name – 36 Major Telephone Operating Company Problem: – Mergers and acquisitions of wireless companies resulted in the RBOC’s inability to determine common customers among wireline and wireless businesses – Customer databases for each business unit use different schema and contain many quality problems – RBOC’s experience with commercial vendor’s data reconciliation tool was unsatisfactory Solution: – Use small manually-verified data samples (~100 records) to determine appropriate matching rules – Use machine learning to prune the rules for efficient analysis of the large dataset – Resulted in 30% more correct matches than the commercial tool Doc Name – 37 Large Media Conglomerate Problem: – Company provides magazines to wholesalers who in turn provide magazines to resalers for distribution – Reconciliation of wholesaler and retailer databases would make it easier to track where gaps in reporting are occurring – Identify ‘bad’ retailers Solution: – Group by primary keys – Match by secondary keys – E.g. 3000 C.V.S. Pharmacies are grouped and compared by zip code and street address – identify ‘bad’ pharmacies Doc Name – 38 International Government Problem: – Reconcile vital taxpayer data from several different sources – Known problems include record duplication, address mismatches, address obsolescence, distributed responsibility for database accuracy and updates – Identify causes for mistakes Solution: – Improve the process flows and architecture to allow for rapid modification of pre-processing rules and matching rules – Detection and classification of likely causes of duplication – Analysis and improvements reduced number of records that required manual verification Doc Name – 39 ILEC-CLEC Billing Reconciliation Problem – ILECs charge CLECs for use of network resources – Verification of actual usage vs. charging E.g customer changing providers – Identify actual usage and send verification to ILEC – Resource identification in ILEC and CLEC is different Solution – Check charges in bill against actual usage – Common identification of resources (matching table) – Solution has only been implemented for access line charge Doc Name – 40 Future Directions Reduction of manual work – Type based standardization, rules – Rules and Parameter pruning? Database and tool – Rule execution plan E.g. string length before edit distance – Sharing/maintenance of index across steps in process – Extending matching to structures (trees, graphs – circuits) – Matching across layers in networks (OSI layers) – How often to discover/audit network? – sampling techniques Doc Name – 41 References [Cochinwala01a ]M Cochinwala, V. Kurien, G. Lalk and D. Shasha, “Efficient data reconciliation”, Information Sciences, 137 (1-4) , 2001. [Cochinwala01b] M. Cochinwala, S. Dalal, A. Elmagarmid and V. Verykios, “Record matching: past, present and future”, submitted for publication. [DeySarkarDe]D. Dey, S. Sarkar and P. De, “Entity matching in heterogeneous databases: a distance-based decision model”, 31st Hawaii int. Conf. On System Sciences, 1998. [Hernadez96] M. Hernadez, “A generalization of band joins and the merge/purge problem”, Ph.D. Thesis, CS Dept, Columbia University, 1996. [Hernadez98] M. Hernadez and S. Stolfo, “Real-world data is dirty: data cleansing and the merger/purge problem”, Data Mining and Knowledge Discovery, 2 (1), 1998. [Manber89] U. Manber, “Introductions to Algorithms”, Addiso_Wesley, 1989. [MongeElkan97] A. Monge and C. Elkan, “An efficient domain-independent algorithm for detecting approximately duplicate database records”, SIGMOD DMKD Workshop, 1997. [Newcombe59] H. Newcombe, J. Kennedy, S. Axford and A. James, “Automatic linkage of vital records”, Science 130(3381), 1959. [Newcombe88]H. Newcombe, “Handbook of Record Linkage”, Oxford University Press, 1988. [SmithWaterman81] T. Smith and M. Waterman, “Identification of common molecular subsequences”, J. Mol. Biol. 147, 1981. Doc Name – 42