Incorporating Singe-Site and Network-based Data Quality
Assessment in the SAFTINet Distributed Research Network
Lisa Schilling, MD, MSPH
Department of Medicine
University of Colorado School of Medicine
Colorado Health Outcomes Program
Addressing Variations in Data Quality to Facilitate Multi-Institutional Comparative Effectiveness Research
AcademyHealth 2013 Annual Research Meeting
Monday June 24, 2013 11:30A-1:00P
Funding provided by AHRQ 1R01HS019908 (Scalable Architecture for Federated Translational Inquiries Network)
O
M
O
P
Partners and Collaborators
•
University of Colorado School of Medicine
•
American Academy of Family Physicians (AAFP)
•
Ohio State University (OSU) Department of Biomedical Informatics
• Colorado Community Managed Care Network and the Colorado
Associated Community Health Information Enterprise
(CCMCN/CACHIE)
•
Salud Family Health Centers
•
Metro Community Provider Network (MCPN)
• Denver Health and Hospital Authority (DHHA)
• Cherokee Health Systems (CHS)
• Colorado Department of Health Care Policy & Financing (HCPF)
•
QED Clinical d/b/a CINA
•
Observational Medical Outcomes Partnership (OMOP)
• Recombinant Data Corporation, Inc
• ROSITA: Reusable OMOP and
SAFTINet Interface Adaptor
• ROSITA: The only bilingual Muppet
• Converts EHR data into research limited data set
1. Replaces local codes with standardized codes
2. Replaces direct identifiers with random identifiers
3. Supports clear-text and encrypted record linkage
4. Provides data quality metrics
5. Pushes data sets to grid node for distributed queries
Or flat files
Concept mapping
• Profiling and mapping EHR or surrogate EHR data is huge!
– Difficult to validate
– Continuous responsibility
– Low cost, low burden
• Three large data “sources” to assess:
– The “raw” source data from ETL (input)
– The post-processed data in OMOP CDM V4 (ROSITA output)
– Cross-grid comparisons (SAFTINet query portal)
Single site data quality assessment using
ROSITA
• ROSITA reporting system based on
JasperServer Community Edition
• LZ (landing zone) = Data with sitespecific values & coding schema
• OMOP = transformed data into CDM
V4 format and OMOP conceptIDs
Three types of DQ profiling reports:
1. LZ (original data)
2. OMOP (transformed data)
3. LZ-OMOP comparison (what changed during transformation?)
• Rules-based data profiling system originally created by OMOP investigators
• Calculates different DQ stats based on data type
• DQ results are stored in a dedicated table in
ROSITA
• DQ results are pushed to the grid node for multisite queries
SAFTINet OSCAR rules by variable type
Statistic Type
1 – Count (of records)
2 – Mean
3 – Standard Deviation
4 – Minimum
5 – 25 th Percentile
6 – Median
7 – 75 th Percentile
8 – Maximum
9 – Number of NULL Values
10 – Number of Empty String Values
11 – Count (distinct)
Numeric
*
Variable Type
Continuous
Date
*
ID
*
Categorical
* This will return the count of empty string values when the underlying column being analyzed is of type VARCHAR. Otherwise it will return 0.
The following rule would be used to gather Continuous statistics for the visit_occurrence_id field in the visit_occurence table:
Column Name Value source_schema_name omop source_table_name variable_name variable_type data_type variable_formula visit_occurrence visit_occurrence_id
1 – Continuous
3 – ID visit_occurrence_id
1= count
9= # of null
10= # of empty string
11= count distinct
And the following results would be generated for this rule: source_schema_ name omop omop omop omop source_table_ name visit_occurrence visit_occurrence visit_occurrence visit_occurrence variable_name variable_ type visit_occurrence_id variable_ value
1 NULL visit_occurrence_id 1 NULL visit_occurrence_id visit_occurrence_id
1 NULL
1 NULL statistic_ type
1
9
10
11 statistic_ value
564
0
0
564
Variable type:
Stat type:
1= count by record
•
JREPORT001 Landing Zone Key Summary Statistics:
– Summary statistics for Landing Zone database; Includes, for select fields:
– Number of records in every table
– Summary statistics (mean, minimum/maximum, number of missing) on numeric fields
– Summary statistics (frequency) on categorical/character fields (excluding direct identifiers, such as Social
Security Number, Medical Record Number, names, and addresses)
• JREPORT002 Random records for manual chart validation:
– Selection of records for chart review Randomly selects 25 visit occurrence records (and all associated records from person, provider, care site, organization, drug exposure, procedure occurrence, condition occurrence, and observation tables) from the ROSITA landing zone database.
• JREPORT003 OMOP Database Key Summary Statistics:
– Summary statistics for OMOP database
– Source: SAFTINet Data Validation V1.0 2012 Nov 20.docx
Care site statistics by visits / by patients (LZ)
Care site statistics by visits / by patients (OMOP)
Network-based DQ comparison queries
(Future work)
1. Implement DQ query in DCQL
2. Submit query to grid
3. Returns DQ statistics across all nodes
• A rules-based data quality assessment infrastructure enables standardized DQ measures across all data partners
• Three levels of DQ assessment:
– LandingZone (raw)
– OMOP (transformed)
– Grid (multi-site)
• Exploring graphical visualizations (in R) as rapid screening
DQ Visualization: Thousands of data values in one glance…..
Thank you!
Questions?
Lisa.Schilling@ucdenver.edu