Michael Kahn, Bill LeBlanc University of Colorado School of Medicine Chris Uhrich

advertisement
Michael Kahn, Bill LeBlanc
University of Colorado School of Medicine
Chris Uhrich
Recombinant by Deloitte
OMOP Data Management Workgroup
4-April-2013
Funding provided by AHRQ 1R01HS019908 (Scalable Architecture for Federated Translational Inquiries Network)
Grid Portal
ROSITA-GRID-PORTAL
Why ROSITA?
• ROSITA: Reusable OMOP and
SAFTINet Interface Adaptor
• ROSITA: The only bilingual Muppet
• Converts EHR data into research limited data set
1.
2.
3.
4.
5.
Replaces local codes with standardized codes
Replaces direct identifiers with random identifiers
Supports clear-text and encrypted record linkage
Provides data quality metrics
Pushes data sets to grid node for distributed queries
Transforming EHR Data:
What does ROSITA do?
OBSERVATIONAL
MEDICAL
OUTCOMES
PARTNERSHIP
B: Raw-CDM Comparison
• Need to test ETL process
• Multiple approaches
– Execute Observational Source
Characteristics Analysis Report
(OSCAR) to systematically
generate summary statistics on
OMOP common data model, and
compare with results from custom
programs that produce the same
summary statistics on the raw
data
– Independent programming and
exact replication on large random
sample
CDM to be
tested
OSCAR
OBSERVATIONAL
MEDICAL
OUTCOMES
PARTNERSHIP
D: Data Anomaly Review – GROUCH
Generalized Review of OSCAR
Unified Checking (GROUCH)
produces a summary data quality
report each database:
CDM to be
tested
GROUCH detects data anomalies:
1. Concept –
existence and relative frequency of codes
compared to benchmark
OSCAR
•
•
•
Source 1
CDM
Source 1
CDM
Source 2
CDM
Source 3
CDM
Source 4
CDM
Source 5
CDM
Invalid concepts
Concepts appear in one source, not in
others
Prevalence in one source is statistically
different from others
2. Boundary –
suspicious or implausible values
•
•
•
OSCARs of other
databases for benchmark
Dates outside range (e.g. drug end date
< drug start date)
Implausible values (e.g. year of birth >
2010)
Suspicious data (e.g. days supply > 180)
3. Temporal –
patterns over time
•
Unstable rates over time
Initial OSCAR Rules Table
Dictionary: cz_oscar_result
Source: OMOP OSCAR v22 specification 26aug2010.doc
Table 1: OSCAR output dataset dictionary
Variable Name
Variable Description
SOURCE_TABLE_NAME
This contains the name of the table containing the variable
being summarized.
This contains the name of the variable to be summarized.
The method of summarization. Must be one of the following :
1 = continuous
2 = categorical
Level of summary
1 = Number of Observations
2 = Number of Patients
For categorical variables being summarized, this contains all
of the values of the variable. If the variable being summarized
as numeric, this variable will be null.
For numeric variables, there will be a number of statistics
calculated. Categorical variables will always have a value of 1.
The list below contains the possible values :
1 = count
2 = mean
3 = standard deviation
4 = minimum
5 = 25th percentile
6 = median
7 = 75th percentile
8 = maximum
9 = number of missing values
This contains the value of the statistic being calculated.
VARIABLE_NAME
VARIABLE_TYPE
SUMMARY_LEVEL
VARIABLE_VALUE
STATISTIC_TYPE
STATISTIC VALUE
OSCAR Results in ROSITA
Jasper Reports
•
JREPORT001:
–
–
–
–
•
JREPORT002:
–
–
•
Summary statistics for Landing Zone database; Includes, for select fields:
Number of records in every table
Summary statistics (mean, minimum/maximum, number of missing) on numeric fields
Summary statistics (frequency) on categorical/character fields (excluding direct identifiers, such as Social Security Number,
Medical Record Number, names, and addresses)
Selection of records for chart review for CDW002
Randomly selects 25 visit occurrence records (and all associated records from person, provider, care site, organization,
drug exposure, procedure occurrence, condition occurrence, and observation tables) from the ROSITA landing zone
database.
JREPORT003:
– Summary statistics for OMOP database
– Source: SAFTINet Data Validation V1.0 2012 Nov 20.docx
Jasper Reports
cz_oscar_result
is a table of summary statistics
Different types of variables summarize differently:
– Numeric variables: means, medians, min, max, sd
– Character variables: category counts, # missing,
#null
– Dates: min, max, missing
SQL Code to read cz_oscar_result
• select c.source_table_name, c.variable_name, c.statistic_type,
c.statistic_value
from cz.cz_oscar_result as c
where c.source_schema_name = 'lz'
and
c.source_table_name = 'lz_src_visit_occurrence' and
c.variable_name
= 'x_provider_src_value' and
variable_type
=2
and
statistic_type
=1
Problems with Initial Attempts
As we tried to implement JR1-3, and later JR4-10 we
realized that:
• We had to go to the lz and omop source tables more often,
• We had to rework the variable matrix and rules table,
redefining variable types and summary statistics so that
cz_oscar_result would contain what we wanted
• Better to do intensive processing on the back end, storing
results in cz_oscar_result
Proposals for Discussion
1. Replace continuous versus categorical with
additional variable types or add new data type
for continuous variable types
2. Expand the available summary statistics
3. Review table of variable types versus summary
statistics
– Good news: No change to OSCAR result table
structure!
4. Expand OSCAR rule structure to allow joins
Initial Oscar Variable Map
Table 5: OSCAR Analyses across OMOP Common Data Model
OSCAR Default Statistics
OMOP Table
PERSON
OMOP Field
care_site_id
day_of_birth
ethnicity_concept_id
ethnicity_source_value
gender_concept_id
gender_source_value
location_id
month_of_birth
person_id
person_source_value
provider_id
race_concept_id
race_source_value
year_of_birth
care_site_id * provider_id
gender_concept_id (cat) * year_of_birth^(cont)
Summary Level
Observation
Person
Summarize as
Continuous
Categorical
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Variable Types
Reasons for new variable types:
• To provide end users with a clearer
understanding of which statistics are gathered
for each data type.
• To not gather “meaningless” statistics for certain
variable types, such as assigned IDs.
New Variable Types or Data Types
What is the preferred way to extend the variable
types?
A. Keep the existing variable types and add a new
data type column to the rules: 1 – Numeric,
2 – Date, 3 – ID
B. Or add new variable types: 3 – Continuous ID,
4 – Continuous Date
Note that the tables and examples which follow currently reflect option A.
Variable Types
Statistic Type
1 – Count (of records)
2 – Mean
3 – Standard Deviation
4 – Minimum
5 – 25th Percentile
6 – Median
7 – 75th Percentile
8 – Maximum
9 – Number of NULL Values
10 – Number of Empty String Values
11 – Count (distinct)
Numeric









*

Variable Type
Continuous
Date
ID


Categorical




*


*

* This will return the count of empty string values when the underlying column being
analyzed is of type VARCHAR. Otherwise it will return 0.
OSCAR Rules Table
• The oscar_rule table is used to control what
types of statistics will be gathered for specific
table columns and calculated variables
• Where appropriate, the column names in the
rules table match those in the results table
OSCAR Rules Table (Page 1)
Column Name
oscar_rule_id
source_schema_name
source_table_name
variable_name
Type
bigint
varchar(50)
varchar(50)
varchar(50)
variable_type
integer
data_type
integer
variable_formula
varchar(2000)
variable_description_level_1
variable_group_by_level_1
variable_description_level_2
variable_group_by_level_2
variable_description_level_3
variable_groub_by_level_3
variable_description_level_4
variable_group_by_level_4
varchar(2000)
varchar(50)
varchar(2000)
varchar(50)
varchar(2000)
varchar(50)
varchar(2000)
varchar(50)
Description
PK for oscar_rule
Name of schema for primary table
Name of primary table
Name of variable for display/selection purposes, usually column
name
Type of summary to generate:
1 – Continuous
2 – Categorical
Date type to specify how data should be treated for analysis. This
value is ignored for Categorical variable types.
1 – Numeric
2 – Date
3 – ID
Formula used to calculate values, e.g. calculated year or quarter or
to cast a value to a another type
Description of level 1 grouping
Group by variable for level 1
Description of level 2 grouping
Group by variable for level 2
Description of level 3 grouping
Group by variable for level 3
Description of level 4 grouping
Group by variable for level 4
OSCAR Rules Table (Page 2)
Column Name
join_column_name_a
join_schema_name_b
join_table_name_b
join_column_name_b
join_type_ab
Type
varchar(50)
varchar(50)
varchar(50)
varchar(50)
varchar(50)
join_schema_name_c
join_table_name_c
join_column_name_c
join_type_bc
join_schema_name_d
join_table_name_d
join_column_name_d
join_join_type_cd
contains_phi
varchar(50)
varchar(50)
varchar(50)
varchar(50)
varchar(50)
varchar(50)
varchar(50)
varchar(50)
character(1)
Description
Name of column used in join from primary table (source_table_name)
Name of schema for secondary table
Name of secondary table
Name of column used in join to secondary table
Join type to use when joining primary to secondary table:
1 – INNER JOIN
2 – LEFT OUTER JOIN
3 – RIGHT OUTER JOIN
4 – FULL OUTER JOIN
Name of schema for tertiary table
Name of tertiary table
Name of column in tertiary table
Join type to use when joining secondary to tertiary table
Name of schema for quaternary table
Name of quaternary table
Name of column in quaternary table
Join type to use when joining tertiary to quaternary table
Indicates whether or not field contains values which might be
considered PHI
OSCAR Rules Table – New Columns
• source_schema_name is required because we profile source data
in our Landing Zone (LZ) schema as well as in the OMOP schema
• data_type column is used to determine which statistics to collect for
continuous variable types
• variable_formula is used to specify a calculated variable, such as
an year value extracted from a date value, or to cast a field’s data
type, e.g. VARCHAR source value to DATE type
• join columns are used to permit grouping of data across tables,
e.g. to group by care_site_id in a query against the drug_exposure
table
• contains_phi is used to flag statistics gathered for a field which may
contain PHI. This would not apply to calculated or count statistics,
but would apply where actual values are captured.
Example – Continuous ID
The following rule would be used to gather Continuous statistics for the
visit_occurrence_id field in the visit_occurence table:
Column Name
source_schema_name
source_table_name
variable_name
variable_type
data_type
variable_formula
Value
omop
visit_occurrence
visit_occurrence_id
1 – Continuous
3 – ID
visit_occurrence_id
And the following results would be generated for this rule:
source_schema_
name
source_table_
name
variable_name
omop
omop
omop
omop
visit_occurrence
visit_occurrence
visit_occurrence
visit_occurrence
visit_occurrence_id
visit_occurrence_id
visit_occurrence_id
visit_occurrence_id
variable_ variable_
type
value
1
1
1
1
NULL
NULL
NULL
NULL
statistic_
type
1
9
10
11
statistic_
value
564
0
0
564
Example – Continuous ID w/ Group and Join
The following rule would be used to gather Continuous statistics for the
visit_occurrence_id field in the drug_exposure table grouped by care_site_id
which is obtained from the joined visit_occurrence table:
Column Name
source_schema_name
source_table_name
variable_name
variable_type
data_type
variable_formula
variable_description_level_1
variable_group_by_level_1
join_column_name_a
join_schema_name_b
join_table_name_b
join_column_name_b
join_type_ab
Value
omop
drug_exposure
a.visit_occurrence_id
1 – Continuous
3 – ID
visit_occurrence_id
care_site_id
b.care_site_id
a.visit_occurrence_id
omop
visit_occurrence
b.visit_occurrence_id
2 – LEFT OUTER JOIN
Example – Continuous ID w/ Group and Join
And the following results would be generated for this rule:
source_
schema_
name
omop
omop
omop
omop
omop
omop
omop
omop
source_
table_
name
drug_exposure
drug_exposure
drug_exposure
drug_exposure
drug_exposure
drug_exposure
drug_exposure
drug_exposure
variable_name
visit_occurrence_id
visit_occurrence_id
visit_occurrence_id
visit_occurrence_id
visit_occurrence_id
visit_occurrence_id
visit_occurrence_id
visit_occurrence_id
variable_
type
variable_
value
1
1
1
1
1
1
1
1
NULL
NULL
NULL
NULL
NULL
NULL
NULL
NULL
statistic_
type
statistic_
value
1
1
9
9
10
10
11
11
280
284
0
0
0
0
280
284
variable_
variable_
description_
value_
level_1
level_1
care_site_id
1
care_site_id
2
care_site_id
1
care_site_id
2
care_site_id
1
care_site_id
2
care_site_id
1
care_site_id
2
Example of the query created to gather the record counts:
select 'omop' source_schema_name, 'drug_exposure' source_table_name,
'drug_type_concept_id' variable_name, 1 variable_type, null variable_value
1 statistic_type, count(1) statistic_value,
'care_site_id' variable_description_level_1, b.care_site_id variable_value_level_1
from
omop.drug_exposure a
left outer join omop.visit_occurrence b on a.visit_occurrence_id = b.visit_occurrence_id
group by
b.care_site_id;
Example – Categorical w/ Variable Formula
The following rule would be used to gather Categorical statistics for the
calculated visit_start year field in the visit_occurrence table grouped by
care_site_id:
Column Name
source_schema_name
source_table_name
variable_name
variable_type
variable_formula
variable_description_level_1
variable_group_by_level_1
Value
omop
visit_occurrence
visit_start_year
2 – Categorical
extract(year from (cast(visit_start_date as date)))
care_site_id
care_site_id
And the following results would be generated for this rule:
source_
schema_
name
omop
omop
omop
omop
source_
table_
name
visit_occurrence
visit_occurrence
visit_occurrence
visit_occurrence
variable_name
visit_start_year
visit_start_year
visit_start_year
visit_start_year
variable_
type
variable_
value
2
2
2
2
2011
2012
2011
2012
statistic_
type
statistic_
value
1
1
1
1
139
141
140
144
variable_
variable_
description_
value_
level_1
level_1
care_site_id
1
care_site_id
2
care_site_id
1
care_site_id
2
Derived Statistics
• Query against the OSCAR results to gather
additional statistics for Categorical variable type:
– Count (distinct) – query number of results for variable
– Number of NULL – query statistic_value where
variable_value is NULL
– Number of empty string – query statistic_value where
variable_value = ‘’
– Occurrence counts – query count of statistic_value
grouped by statistic_value
Download