“Good decisions require good data” Process for Data Quality Assurance at Manitoba Centre for Health Policy (MCHP) Mahmoud Azimaee Data Analyst at ICES Literature and Resources • CIHI Data Quality Framework, (2009 edition) • UK’s NHS Data Quality Reports • Handbook on Data Quality Assessment Methods and Tools, (European Commission) • Handbook on Improving Quality by Analysis of Process Variables , (European Commission) • Data fitness (Australian National Statistical Service) Data Quality at MCHP 1. Data Quality Indicators 2. Rating System – CIHI Data Quality Framework, (2009 edition) 3. Data Quality Report – UK’s NHS Data Quality Reports 4. Practical Approach 5. Automation – Cody’s Data Cleaning Techniques Using SAS, (by Ron Cody) Data Quality Indicators and Rating System • Example: – Completeness: Rate of missing values for all data elements. • Consistency : Agreement with registry database. MCHP Data Quality Framework: Data Quality Assurance Database Level (In Data Management) Accuracy Completeness (Missing Values) Correctness (Invalid codes, Invalid Dates, Out of Range, Outliers and Extreme Observations) Internal Validity Internal Consistency Stability Across Time Linkability External Validity Identifying Units of Analysis (Persons, Places, Things, ...) Level of Agreement With the Literature and available reports Timeliness Time to Acquisition Time to Release Currency of Data Research Level (In a Specific Research Projects) Interpretability Availability and Quality of: Documents , Policies and Procedures, Formats Libraries, Metadata, Data Model Diagrams Accuracy Completeness Measurement Error Level of Bias Degree of Problems with Consistency Reliability Level of Agreement With Other Databases Data Management Process at MCHP 1. Formulate the Request and Receive the Data Check the data sharing agreements Liaise with the source agency to acquire available data, data model diagram, data dictionary, documentation about historical changes in data content, format, and structure, data quality reports Prepare the data request letter Receive the data and associated documentation 2. Become Familiar with Data Structure and Content Review provided documentation If required, create a data model for the original data If receiving test data, test it and send feedback to the source agency 3. Apply SAS Programs Apply Normalization or De-normalization as required Normalization can be defined as the practice of optimizing table structures by eliminating redundancy and inconsistent dependency Apply data field and SAS format standards Install on SPD server (This includes indexing, sorting and clustering) Create Metadata If there is a problem, liaise with the source agency 4. Evaluate Data Quality Test the installed data using standardized protocol Identify solutions to address deficiencies in data quality Prepare data quality report for addition to standard documentation 5. Document Data Including original documents, data model diagram, SPDS data dictionary, history, file variations and structural changes, revisions and common problems and data quality report, where available 6. Release Data to Analyst(s) and Researcher(s) Meet with programmer(s) and researcher(s) to present data structure and content How to Present Data Quality Results? • CIHI Data Quality Report • UK’s NHS Data Quality Report – VODIM Test Analysis Methodology • • • • • Valid Other Default Invalid Missing • • • • Valid Invalid Missing Outlier VIMO! VIMO Table (1) I just discovered that the data system we have been working on for the last five years has major data quality problems. (2) That is why I treat data systems the same way I do sausage – I do not want to know what is inside either one. (3) Ouch!! That is why I am a vegetarian! Conversation from: Data Quality and Record Linkage Techniques, Thomas N. Herzog, et al. 2007, Springer Operational Approaches • Example 1: Identifying Outliers/Extreme Observations: 1. 2. Standard Deviation (Mean +/- 2*SD) Trimmed Standard Deviation (MeanTrimmed10% +/- 2*1.49*SDTrimmed10%) 3. Interquartile Range (Q1 – k*IQR , Q3 + k*IQR), k=2.5 – Ordered statistics for calculating quartiles is very memory intensive =>> P² method to approximate the quartiles (Using QMETHOD=P2 in PROC MEANS) [piecewise-parabolic (P²) algorithm invented by Jain and Chlamtac (1985)] Operational Approaches Example 2: Stability Across Time Based on CIHI guideline: – Trend analysis is used to examine changes in core data elements over time – No change across years may also be an indication of a problem if the data is expected to naturally trend upward or downward – Changes in methodology or inclusion/exclusion criteria should be taken into account to determine whether the observed changes were real or not. Example 2: Stability Across Time (Continued) • Identify unusual changes – Outlier analysis • Outlier analysis requires a model – How to choose an appropriate model in an automated fashion? • Fit a series of common models: – Simple Linear: Y=β0 + β 1X – Quadratic: Y= β 0 + β 1X2 – Exponential: Y= β 0 + β 1exp(X) – Logarithmic: Y= β 0 + β 1log(X) – SQRT: Y= β 0 + β 1 𝑥 – Inverse: Y= β 0 + β 1 1 𝑥 – Negative Exponential: Y= β 0 + + β 1Exp(-X) Example 2: Stability Across Time (Continued) • Choose the best model with the minimum MSE • Re-fit the chosen model on the data • Do an outlier analysis – Estimate Studentized residuals for each observation (with the current observation deleted) • Flag significant observations as potential outliers • Flag observation with no changes over time • How about Small Cell Size Policy? (0<Frequency<6) – Use the actual values in modeling but flag and then force them to 3 in the report Automation • MCHP’s data repository includes over 65 health and other administrative databases, (linkable using a common encrypted individual identifier). • Annual updates for most of the databases in its repository. • Designing an automated process became a must! Automation • A SAS Macro based application package was developed (16 Macros) – Pre Data Quality Macro (1) – Main Macros (6) – Intermediate Macros (9) Automation Documentation System GETNOBS Macro METADATA MACRO INVALID Macro GETVARLIST Macro VIMO Macro POSTMUN Macro GETFORMAT Macro (Continued) OUTLIER Macro Special Features: • Can handle standalone and Clustered tables • Can Validate Postal and Municipal codes Automation GETNOBS Macro LINK Macro (Continued) Automation GETNOBS Macro TREND Macro MONTHLY Macro FISCAL Macro (Continued) Automation (Continued) CONTENT MACRO AGREEMENT MACRO PHINCHECK MACRO • • Checks 3rd and 5th positions of PHINs which must be 0 and 9 Compares the distribution of the first position with the corresponding PHINs from registry files Non-Automated Indicators • Internal Consistency • Timeliness Data Quality Assurance Database Level (In Data Management) Accuracy Completeness (Missing Values) Correctness (Invalid codes, Invalid Dates, Out of Range, Outliers and Extreme Observations) VOMO Macro Internal Validity Internal Consistency Stability Across Time Linkability External Validity Identifying Units of Analysis (Persons, Places, Things, ...) Level of Agreement With the Literature and available reports TREND Macro LINK Macro PHINCHECK Macro AGREEMENT Macro Timeliness Time to Acquisition Time to Release Research Level (In a Specific Research Projects) Interpretability Availability and Quality of: Documents , Policies and Procedures, Formats Libraries, Metadata, Data Model Diagrams Accuracy Completeness Measurement Error Level of Bias Degree of Problems with Consistency Reliability Level of Agreement With Other Databases • Data Quality Website Missing Links! • Central Format Library • Metadata Database • Standardization – Bad standards are better than no standards at all! Data Quality As A Science • Data Quality Algebra • Data Quality Axioms Acknowledgment • Mr. Mark Smith (MCHP Associate Director, Repository) • Dr. Lisa Lix (Associate Professor at University of Saskatchewan) CONTACT INFORMATION Mahmoud Azimaee Institute for Clinical Evaluative Sciences Work Phone: (647) 480-4055 (Ex. 3618) E-mail: mahmoud.azimaee@ices.on.ca Web: www.dastneveshteha.com