Process for Data Quality Assurance

advertisement
“Good decisions require good data”
Process for
Data Quality Assurance
at Manitoba Centre for Health Policy (MCHP)
Mahmoud Azimaee
Data Analyst at ICES
Literature and Resources
• CIHI Data Quality Framework, (2009
edition)
• UK’s NHS Data Quality Reports
• Handbook on Data Quality
Assessment Methods and Tools,
(European Commission)
• Handbook on Improving Quality by
Analysis of Process Variables
,
(European Commission)
• Data fitness
(Australian National Statistical Service)
Data Quality at MCHP
1. Data Quality Indicators
2. Rating System
– CIHI Data Quality Framework,
(2009 edition)
3. Data Quality Report
– UK’s NHS Data Quality Reports
4. Practical Approach
5. Automation
– Cody’s Data Cleaning
Techniques Using SAS, (by Ron Cody)
Data Quality Indicators and Rating
System
• Example:
– Completeness: Rate of missing values for all data
elements.
• Consistency : Agreement with registry database.
MCHP Data Quality Framework:
Data Quality Assurance
Database Level
(In Data Management)
Accuracy
Completeness
(Missing
Values)
Correctness
(Invalid codes,
Invalid Dates,
Out of Range,
Outliers and
Extreme
Observations)
Internal
Validity
Internal
Consistency
Stability Across
Time
Linkability
External
Validity
Identifying
Units of
Analysis
(Persons,
Places, Things,
...)
Level of
Agreement
With the
Literature and
available
reports
Timeliness
Time to
Acquisition
Time to Release
Currency of
Data
Research Level
(In a Specific Research Projects)
Interpretability
Availability and
Quality of:
Documents ,
Policies and
Procedures,
Formats
Libraries,
Metadata,
Data Model
Diagrams
Accuracy
Completeness
Measurement
Error
Level of Bias
Degree of
Problems with
Consistency
Reliability
Level of
Agreement
With Other
Databases
Data
Management
Process at
MCHP
1. Formulate the Request and Receive the Data
Check the data
sharing
agreements
Liaise with the source agency to acquire available data, data model diagram,
data dictionary, documentation about historical changes in data content,
format, and structure, data quality reports
Prepare the
data request
letter
Receive the data
and associated
documentation
2. Become Familiar with Data Structure and Content
Review provided documentation
If required, create a data model for the
original data
If receiving test data, test it and send
feedback to the source agency
3. Apply SAS Programs
Apply Normalization or De-normalization as required
Normalization can be defined as the practice of optimizing table
structures by eliminating redundancy and inconsistent dependency
Apply data field
and SAS format
standards
Install on SPD server
(This includes indexing,
sorting and clustering)
Create
Metadata
If there is a problem,
liaise with the source
agency
4. Evaluate Data Quality
Test the installed data using standardized
protocol
Identify solutions to address deficiencies in
data quality
Prepare data quality report for addition to
standard documentation
5. Document Data
Including original documents, data model diagram, SPDS data dictionary, history, file variations and structural changes, revisions and
common problems and data quality report, where available
6. Release Data to Analyst(s) and Researcher(s)
Meet with programmer(s) and researcher(s) to present data structure and content
How to Present Data Quality
Results?
• CIHI Data Quality Report
• UK’s NHS Data Quality Report
– VODIM Test Analysis Methodology
•
•
•
•
•
Valid
Other
Default
Invalid
Missing
•
•
•
•
Valid
Invalid
Missing
Outlier
VIMO!
VIMO Table
(1) I just discovered that the
data system we have been
working on for the last five
years has major data quality
problems.
(2) That is why I treat data
systems the same way I do
sausage – I do not want to know
what is inside either one.
(3) Ouch!! That is why I am a
vegetarian!
Conversation from: Data Quality and Record Linkage Techniques, Thomas N. Herzog, et al. 2007, Springer
Operational Approaches
• Example 1: Identifying Outliers/Extreme
Observations:
1.
2.
Standard Deviation (Mean +/- 2*SD)
Trimmed Standard Deviation
(MeanTrimmed10% +/- 2*1.49*SDTrimmed10%)
3. Interquartile Range
(Q1 – k*IQR , Q3 + k*IQR), k=2.5
– Ordered statistics for calculating quartiles is very
memory intensive
=>> P² method to approximate the quartiles
(Using QMETHOD=P2 in PROC MEANS)
[piecewise-parabolic (P²) algorithm invented by
Jain and Chlamtac (1985)]
Operational Approaches
Example 2: Stability Across Time
Based on CIHI guideline:
– Trend analysis is used to examine changes in
core data elements over time
– No change across years may also be an
indication of a problem if the data is expected to
naturally trend upward or downward
– Changes in methodology or inclusion/exclusion
criteria should be taken into account to determine
whether the observed changes were real or not.
Example 2: Stability Across Time (Continued)
• Identify unusual changes
– Outlier analysis
• Outlier analysis requires a model
– How to choose an appropriate model in an
automated fashion?
• Fit a series of common models:
– Simple Linear: Y=β0 + β 1X
– Quadratic: Y= β 0 + β 1X2
– Exponential: Y= β 0 + β 1exp(X)
– Logarithmic: Y= β 0 + β 1log(X)
– SQRT: Y= β 0 + β 1 𝑥
– Inverse: Y= β 0 + β 1
1
𝑥
– Negative Exponential: Y= β 0 + + β 1Exp(-X)
Example 2: Stability Across Time (Continued)
• Choose the best model with the minimum MSE
• Re-fit the chosen model on the data
• Do an outlier analysis
– Estimate Studentized residuals for each observation (with the
current observation deleted)
• Flag significant observations as potential outliers
• Flag observation with no changes over time
• How about Small Cell Size Policy? (0<Frequency<6)
– Use the actual values in modeling but flag and then
force them to 3 in the report
Automation
• MCHP’s data repository includes over 65
health and other administrative databases,
(linkable using a common encrypted
individual identifier).
• Annual updates for most of the databases
in its repository.
• Designing an automated process became a
must!
Automation
• A SAS Macro based application package
was developed (16 Macros)
– Pre Data Quality Macro (1)
– Main Macros (6)
– Intermediate Macros (9)
Automation
Documentation
System
GETNOBS
Macro
METADATA
MACRO
INVALID
Macro
GETVARLIST
Macro
VIMO
Macro
POSTMUN
Macro
GETFORMAT
Macro
(Continued)
OUTLIER
Macro
Special Features:
• Can handle standalone and Clustered tables
• Can Validate Postal and Municipal codes
Automation
GETNOBS
Macro
LINK
Macro
(Continued)
Automation
GETNOBS
Macro
TREND
Macro
MONTHLY
Macro
FISCAL
Macro
(Continued)
Automation
(Continued)
CONTENT
MACRO
AGREEMENT
MACRO
PHINCHECK
MACRO
•
•
Checks 3rd and 5th positions of PHINs which must be 0 and 9
Compares the distribution of the first position with the corresponding
PHINs from registry files
Non-Automated Indicators
• Internal Consistency
• Timeliness
Data Quality Assurance
Database Level
(In Data Management)
Accuracy
Completeness
(Missing
Values)
Correctness
(Invalid codes,
Invalid Dates,
Out of Range,
Outliers and
Extreme
Observations)
VOMO Macro
Internal
Validity
Internal
Consistency
Stability Across
Time
Linkability
External
Validity
Identifying
Units of
Analysis
(Persons,
Places, Things,
...)
Level of
Agreement
With the
Literature and
available
reports
TREND Macro
LINK Macro
PHINCHECK
Macro
AGREEMENT
Macro
Timeliness
Time to
Acquisition
Time to Release
Research Level
(In a Specific Research Projects)
Interpretability
Availability and
Quality of:
Documents ,
Policies and
Procedures,
Formats
Libraries,
Metadata,
Data Model
Diagrams
Accuracy
Completeness
Measurement
Error
Level of Bias
Degree of
Problems with
Consistency
Reliability
Level of
Agreement
With Other
Databases
• Data Quality Website
Missing Links!
• Central Format Library
• Metadata Database
• Standardization
– Bad standards are better than no standards at all!
Data Quality As A Science
• Data Quality Algebra
• Data Quality Axioms
Acknowledgment
• Mr. Mark Smith (MCHP Associate Director, Repository)
• Dr. Lisa Lix (Associate Professor at University of Saskatchewan)
CONTACT INFORMATION
Mahmoud Azimaee
Institute for Clinical Evaluative Sciences
Work Phone: (647) 480-4055 (Ex. 3618)
E-mail: mahmoud.azimaee@ices.on.ca
Web: www.dastneveshteha.com
Download