Incorporating Singe-Site and Network-based Data Quality

advertisement

Incorporating Singe-Site and Network-based Data Quality

Assessment in the SAFTINet Distributed Research Network

Lisa Schilling, MD, MSPH

Department of Medicine

University of Colorado School of Medicine

Colorado Health Outcomes Program

Addressing Variations in Data Quality to Facilitate Multi-Institutional Comparative Effectiveness Research

AcademyHealth 2013 Annual Research Meeting

Monday June 24, 2013 11:30A-1:00P

Funding provided by AHRQ 1R01HS019908 (Scalable Architecture for Federated Translational Inquiries Network)

Acknowledging Collaborations

O

BSERVATIONAL

M

EDICAL

O

UTCOMES

P

ARTNERSHIP

Partners and Collaborators

University of Colorado School of Medicine

American Academy of Family Physicians (AAFP)

Ohio State University (OSU) Department of Biomedical Informatics

• Colorado Community Managed Care Network and the Colorado

Associated Community Health Information Enterprise

(CCMCN/CACHIE)

Salud Family Health Centers

Metro Community Provider Network (MCPN)

• Denver Health and Hospital Authority (DHHA)

• Cherokee Health Systems (CHS)

• Colorado Department of Health Care Policy & Financing (HCPF)

QED Clinical d/b/a CINA

Observational Medical Outcomes Partnership (OMOP)

• Recombinant Data Corporation, Inc

SAFTINet DRN Architecture

Partner Network

Why ROSITA?

• ROSITA: Reusable OMOP and

SAFTINet Interface Adaptor

• ROSITA: The only bilingual Muppet

• Converts EHR data into research limited data set

1. Replaces local codes with standardized codes

2. Replaces direct identifiers with random identifiers

3. Supports clear-text and encrypted record linkage

4. Provides data quality metrics

5. Pushes data sets to grid node for distributed queries

ROSITA: From EHR to CER data

Or flat files

Concept mapping

Profiling Data

• Profiling and mapping EHR or surrogate EHR data is huge!

– Difficult to validate

– Continuous responsibility

– Low cost, low burden

• Three large data “sources” to assess:

– The “raw” source data from ETL (input)

– The post-processed data in OMOP CDM V4 (ROSITA output)

– Cross-grid comparisons (SAFTINet query portal)

Single site data quality assessment using

ROSITA

• ROSITA reporting system based on

JasperServer Community Edition

• LZ (landing zone) = Data with sitespecific values & coding schema

• OMOP = transformed data into CDM

V4 format and OMOP conceptIDs

Three types of DQ profiling reports:

1. LZ (original data)

2. OMOP (transformed data)

3. LZ-OMOP comparison (what changed during transformation?)

Data profiling using OMOP OSCAR

• Rules-based data profiling system originally created by OMOP investigators

• Calculates different DQ stats based on data type

• DQ results are stored in a dedicated table in

ROSITA

• DQ results are pushed to the grid node for multisite queries

SAFTINet OSCAR rules by variable type

Statistic Type

1 – Count (of records)

2 – Mean

3 – Standard Deviation

4 – Minimum

5 – 25 th Percentile

6 – Median

7 – 75 th Percentile

8 – Maximum

9 – Number of NULL Values

10 – Number of Empty String Values

11 – Count (distinct)

Numeric

 *

Variable Type

Continuous

Date

 *

ID

 *

Categorical

* This will return the count of empty string values when the underlying column being analyzed is of type VARCHAR. Otherwise it will return 0.

Example – Continuous ID

The following rule would be used to gather Continuous statistics for the visit_occurrence_id field in the visit_occurence table:

Column Name Value source_schema_name omop source_table_name variable_name variable_type data_type variable_formula visit_occurrence visit_occurrence_id

1 – Continuous

3 – ID visit_occurrence_id

1= count

9= # of null

10= # of empty string

11= count distinct

And the following results would be generated for this rule: source_schema_ name omop omop omop omop source_table_ name visit_occurrence visit_occurrence visit_occurrence visit_occurrence variable_name variable_ type visit_occurrence_id variable_ value

1 NULL visit_occurrence_id 1 NULL visit_occurrence_id visit_occurrence_id

1 NULL

1 NULL statistic_ type

1

9

10

11 statistic_ value

564

0

0

564

Variable type:

OSCAR Results in ROSITA

Stat type:

1= count by record

Jasper Reports

JREPORT001 Landing Zone Key Summary Statistics:

– Summary statistics for Landing Zone database; Includes, for select fields:

– Number of records in every table

– Summary statistics (mean, minimum/maximum, number of missing) on numeric fields

– Summary statistics (frequency) on categorical/character fields (excluding direct identifiers, such as Social

Security Number, Medical Record Number, names, and addresses)

• JREPORT002 Random records for manual chart validation:

– Selection of records for chart review Randomly selects 25 visit occurrence records (and all associated records from person, provider, care site, organization, drug exposure, procedure occurrence, condition occurrence, and observation tables) from the ROSITA landing zone database.

• JREPORT003 OMOP Database Key Summary Statistics:

– Summary statistics for OMOP database

– Source: SAFTINet Data Validation V1.0 2012 Nov 20.docx

ROSITA DQ Reporting

Care site statistics by visits / by patients (LZ)

Care site statistics by visits / by patients (OMOP)

Drug Exposure statistics (LZ)

Network-based DQ comparison queries

(Future work)

1. Implement DQ query in DCQL

2. Submit query to grid

3. Returns DQ statistics across all nodes

Conclusions

• A rules-based data quality assessment infrastructure enables standardized DQ measures across all data partners

• Three levels of DQ assessment:

– LandingZone (raw)

– OMOP (transformed)

– Grid (multi-site)

• Exploring graphical visualizations (in R) as rapid screening

DQ Visualization: Thousands of data values in one glance…..

Thank you!

Questions?

Lisa.Schilling@ucdenver.edu

Download