Michael Kahn, Bill LeBlanc University of Colorado School of Medicine Chris Uhrich Recombinant by Deloitte OMOP Data Management Workgroup 4-April-2013 Funding provided by AHRQ 1R01HS019908 (Scalable Architecture for Federated Translational Inquiries Network) Grid Portal ROSITA-GRID-PORTAL Why ROSITA? • ROSITA: Reusable OMOP and SAFTINet Interface Adaptor • ROSITA: The only bilingual Muppet • Converts EHR data into research limited data set 1. 2. 3. 4. 5. Replaces local codes with standardized codes Replaces direct identifiers with random identifiers Supports clear-text and encrypted record linkage Provides data quality metrics Pushes data sets to grid node for distributed queries Transforming EHR Data: What does ROSITA do? OBSERVATIONAL MEDICAL OUTCOMES PARTNERSHIP B: Raw-CDM Comparison • Need to test ETL process • Multiple approaches – Execute Observational Source Characteristics Analysis Report (OSCAR) to systematically generate summary statistics on OMOP common data model, and compare with results from custom programs that produce the same summary statistics on the raw data – Independent programming and exact replication on large random sample CDM to be tested OSCAR OBSERVATIONAL MEDICAL OUTCOMES PARTNERSHIP D: Data Anomaly Review – GROUCH Generalized Review of OSCAR Unified Checking (GROUCH) produces a summary data quality report each database: CDM to be tested GROUCH detects data anomalies: 1. Concept – existence and relative frequency of codes compared to benchmark OSCAR • • • Source 1 CDM Source 1 CDM Source 2 CDM Source 3 CDM Source 4 CDM Source 5 CDM Invalid concepts Concepts appear in one source, not in others Prevalence in one source is statistically different from others 2. Boundary – suspicious or implausible values • • • OSCARs of other databases for benchmark Dates outside range (e.g. drug end date < drug start date) Implausible values (e.g. year of birth > 2010) Suspicious data (e.g. days supply > 180) 3. Temporal – patterns over time • Unstable rates over time Initial OSCAR Rules Table Dictionary: cz_oscar_result Source: OMOP OSCAR v22 specification 26aug2010.doc Table 1: OSCAR output dataset dictionary Variable Name Variable Description SOURCE_TABLE_NAME This contains the name of the table containing the variable being summarized. This contains the name of the variable to be summarized. The method of summarization. Must be one of the following : 1 = continuous 2 = categorical Level of summary 1 = Number of Observations 2 = Number of Patients For categorical variables being summarized, this contains all of the values of the variable. If the variable being summarized as numeric, this variable will be null. For numeric variables, there will be a number of statistics calculated. Categorical variables will always have a value of 1. The list below contains the possible values : 1 = count 2 = mean 3 = standard deviation 4 = minimum 5 = 25th percentile 6 = median 7 = 75th percentile 8 = maximum 9 = number of missing values This contains the value of the statistic being calculated. VARIABLE_NAME VARIABLE_TYPE SUMMARY_LEVEL VARIABLE_VALUE STATISTIC_TYPE STATISTIC VALUE OSCAR Results in ROSITA Jasper Reports • JREPORT001: – – – – • JREPORT002: – – • Summary statistics for Landing Zone database; Includes, for select fields: Number of records in every table Summary statistics (mean, minimum/maximum, number of missing) on numeric fields Summary statistics (frequency) on categorical/character fields (excluding direct identifiers, such as Social Security Number, Medical Record Number, names, and addresses) Selection of records for chart review for CDW002 Randomly selects 25 visit occurrence records (and all associated records from person, provider, care site, organization, drug exposure, procedure occurrence, condition occurrence, and observation tables) from the ROSITA landing zone database. JREPORT003: – Summary statistics for OMOP database – Source: SAFTINet Data Validation V1.0 2012 Nov 20.docx Jasper Reports cz_oscar_result is a table of summary statistics Different types of variables summarize differently: – Numeric variables: means, medians, min, max, sd – Character variables: category counts, # missing, #null – Dates: min, max, missing SQL Code to read cz_oscar_result • select c.source_table_name, c.variable_name, c.statistic_type, c.statistic_value from cz.cz_oscar_result as c where c.source_schema_name = 'lz' and c.source_table_name = 'lz_src_visit_occurrence' and c.variable_name = 'x_provider_src_value' and variable_type =2 and statistic_type =1 Problems with Initial Attempts As we tried to implement JR1-3, and later JR4-10 we realized that: • We had to go to the lz and omop source tables more often, • We had to rework the variable matrix and rules table, redefining variable types and summary statistics so that cz_oscar_result would contain what we wanted • Better to do intensive processing on the back end, storing results in cz_oscar_result Proposals for Discussion 1. Replace continuous versus categorical with additional variable types or add new data type for continuous variable types 2. Expand the available summary statistics 3. Review table of variable types versus summary statistics – Good news: No change to OSCAR result table structure! 4. Expand OSCAR rule structure to allow joins Initial Oscar Variable Map Table 5: OSCAR Analyses across OMOP Common Data Model OSCAR Default Statistics OMOP Table PERSON OMOP Field care_site_id day_of_birth ethnicity_concept_id ethnicity_source_value gender_concept_id gender_source_value location_id month_of_birth person_id person_source_value provider_id race_concept_id race_source_value year_of_birth care_site_id * provider_id gender_concept_id (cat) * year_of_birth^(cont) Summary Level Observation Person Summarize as Continuous Categorical Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Variable Types Reasons for new variable types: • To provide end users with a clearer understanding of which statistics are gathered for each data type. • To not gather “meaningless” statistics for certain variable types, such as assigned IDs. New Variable Types or Data Types What is the preferred way to extend the variable types? A. Keep the existing variable types and add a new data type column to the rules: 1 – Numeric, 2 – Date, 3 – ID B. Or add new variable types: 3 – Continuous ID, 4 – Continuous Date Note that the tables and examples which follow currently reflect option A. Variable Types Statistic Type 1 – Count (of records) 2 – Mean 3 – Standard Deviation 4 – Minimum 5 – 25th Percentile 6 – Median 7 – 75th Percentile 8 – Maximum 9 – Number of NULL Values 10 – Number of Empty String Values 11 – Count (distinct) Numeric * Variable Type Continuous Date ID Categorical * * * This will return the count of empty string values when the underlying column being analyzed is of type VARCHAR. Otherwise it will return 0. OSCAR Rules Table • The oscar_rule table is used to control what types of statistics will be gathered for specific table columns and calculated variables • Where appropriate, the column names in the rules table match those in the results table OSCAR Rules Table (Page 1) Column Name oscar_rule_id source_schema_name source_table_name variable_name Type bigint varchar(50) varchar(50) varchar(50) variable_type integer data_type integer variable_formula varchar(2000) variable_description_level_1 variable_group_by_level_1 variable_description_level_2 variable_group_by_level_2 variable_description_level_3 variable_groub_by_level_3 variable_description_level_4 variable_group_by_level_4 varchar(2000) varchar(50) varchar(2000) varchar(50) varchar(2000) varchar(50) varchar(2000) varchar(50) Description PK for oscar_rule Name of schema for primary table Name of primary table Name of variable for display/selection purposes, usually column name Type of summary to generate: 1 – Continuous 2 – Categorical Date type to specify how data should be treated for analysis. This value is ignored for Categorical variable types. 1 – Numeric 2 – Date 3 – ID Formula used to calculate values, e.g. calculated year or quarter or to cast a value to a another type Description of level 1 grouping Group by variable for level 1 Description of level 2 grouping Group by variable for level 2 Description of level 3 grouping Group by variable for level 3 Description of level 4 grouping Group by variable for level 4 OSCAR Rules Table (Page 2) Column Name join_column_name_a join_schema_name_b join_table_name_b join_column_name_b join_type_ab Type varchar(50) varchar(50) varchar(50) varchar(50) varchar(50) join_schema_name_c join_table_name_c join_column_name_c join_type_bc join_schema_name_d join_table_name_d join_column_name_d join_join_type_cd contains_phi varchar(50) varchar(50) varchar(50) varchar(50) varchar(50) varchar(50) varchar(50) varchar(50) character(1) Description Name of column used in join from primary table (source_table_name) Name of schema for secondary table Name of secondary table Name of column used in join to secondary table Join type to use when joining primary to secondary table: 1 – INNER JOIN 2 – LEFT OUTER JOIN 3 – RIGHT OUTER JOIN 4 – FULL OUTER JOIN Name of schema for tertiary table Name of tertiary table Name of column in tertiary table Join type to use when joining secondary to tertiary table Name of schema for quaternary table Name of quaternary table Name of column in quaternary table Join type to use when joining tertiary to quaternary table Indicates whether or not field contains values which might be considered PHI OSCAR Rules Table – New Columns • source_schema_name is required because we profile source data in our Landing Zone (LZ) schema as well as in the OMOP schema • data_type column is used to determine which statistics to collect for continuous variable types • variable_formula is used to specify a calculated variable, such as an year value extracted from a date value, or to cast a field’s data type, e.g. VARCHAR source value to DATE type • join columns are used to permit grouping of data across tables, e.g. to group by care_site_id in a query against the drug_exposure table • contains_phi is used to flag statistics gathered for a field which may contain PHI. This would not apply to calculated or count statistics, but would apply where actual values are captured. Example – Continuous ID The following rule would be used to gather Continuous statistics for the visit_occurrence_id field in the visit_occurence table: Column Name source_schema_name source_table_name variable_name variable_type data_type variable_formula Value omop visit_occurrence visit_occurrence_id 1 – Continuous 3 – ID visit_occurrence_id And the following results would be generated for this rule: source_schema_ name source_table_ name variable_name omop omop omop omop visit_occurrence visit_occurrence visit_occurrence visit_occurrence visit_occurrence_id visit_occurrence_id visit_occurrence_id visit_occurrence_id variable_ variable_ type value 1 1 1 1 NULL NULL NULL NULL statistic_ type 1 9 10 11 statistic_ value 564 0 0 564 Example – Continuous ID w/ Group and Join The following rule would be used to gather Continuous statistics for the visit_occurrence_id field in the drug_exposure table grouped by care_site_id which is obtained from the joined visit_occurrence table: Column Name source_schema_name source_table_name variable_name variable_type data_type variable_formula variable_description_level_1 variable_group_by_level_1 join_column_name_a join_schema_name_b join_table_name_b join_column_name_b join_type_ab Value omop drug_exposure a.visit_occurrence_id 1 – Continuous 3 – ID visit_occurrence_id care_site_id b.care_site_id a.visit_occurrence_id omop visit_occurrence b.visit_occurrence_id 2 – LEFT OUTER JOIN Example – Continuous ID w/ Group and Join And the following results would be generated for this rule: source_ schema_ name omop omop omop omop omop omop omop omop source_ table_ name drug_exposure drug_exposure drug_exposure drug_exposure drug_exposure drug_exposure drug_exposure drug_exposure variable_name visit_occurrence_id visit_occurrence_id visit_occurrence_id visit_occurrence_id visit_occurrence_id visit_occurrence_id visit_occurrence_id visit_occurrence_id variable_ type variable_ value 1 1 1 1 1 1 1 1 NULL NULL NULL NULL NULL NULL NULL NULL statistic_ type statistic_ value 1 1 9 9 10 10 11 11 280 284 0 0 0 0 280 284 variable_ variable_ description_ value_ level_1 level_1 care_site_id 1 care_site_id 2 care_site_id 1 care_site_id 2 care_site_id 1 care_site_id 2 care_site_id 1 care_site_id 2 Example of the query created to gather the record counts: select 'omop' source_schema_name, 'drug_exposure' source_table_name, 'drug_type_concept_id' variable_name, 1 variable_type, null variable_value 1 statistic_type, count(1) statistic_value, 'care_site_id' variable_description_level_1, b.care_site_id variable_value_level_1 from omop.drug_exposure a left outer join omop.visit_occurrence b on a.visit_occurrence_id = b.visit_occurrence_id group by b.care_site_id; Example – Categorical w/ Variable Formula The following rule would be used to gather Categorical statistics for the calculated visit_start year field in the visit_occurrence table grouped by care_site_id: Column Name source_schema_name source_table_name variable_name variable_type variable_formula variable_description_level_1 variable_group_by_level_1 Value omop visit_occurrence visit_start_year 2 – Categorical extract(year from (cast(visit_start_date as date))) care_site_id care_site_id And the following results would be generated for this rule: source_ schema_ name omop omop omop omop source_ table_ name visit_occurrence visit_occurrence visit_occurrence visit_occurrence variable_name visit_start_year visit_start_year visit_start_year visit_start_year variable_ type variable_ value 2 2 2 2 2011 2012 2011 2012 statistic_ type statistic_ value 1 1 1 1 139 141 140 144 variable_ variable_ description_ value_ level_1 level_1 care_site_id 1 care_site_id 2 care_site_id 1 care_site_id 2 Derived Statistics • Query against the OSCAR results to gather additional statistics for Categorical variable type: – Count (distinct) – query number of results for variable – Number of NULL – query statistic_value where variable_value is NULL – Number of empty string – query statistic_value where variable_value = ‘’ – Occurrence counts – query count of statistic_value grouped by statistic_value