Strategic Health IT Advanced Research Projects (SHARP) Area 4

advertisement
Strategic Health IT Advanced
Research Projects (SHARP)
Area 4: Secondary Use of EHR Data
Project 3: High-Throughput
Phenotyping
June 30, 2011
Jyoti Pathak, PhD
Assistant Professor of Biomedical Informatics
Department of Health Sciences Research
Project 3: Collaborators and
Acknowledgments
 CDISC (Clinical Data Interchange Standards Consortium)
– Rebecca Kush, Landen Bain, Mark Arratoon
 Centerphase Solutions
– Gary Lubin, Jeff Tarlowe
 Harvard University/MIT
– Guergana Savova, Margarita Sordo, Peter Szolovits
 IBM T.J. Watson Research Labs
– Marshall Schor
 Intermountain Healthcare/University of Utah
– Susan Welch, Herman Post, Darin Wilcox, Peter Haug
 Mayo Clinic
– Cui Tao, Lacey Hart, Erin Martin, Sridhar Dwarkanath, Calvin
Beebe, Kent Bailey, Kevin Bruce, Mike Conway (UCSD)
Outline
 Background
 On-going projects and updates
 Proposed project ideas for Year 2
 Productivity till date
Q&A
The Big Question…
 The era of Genome-Wide Association Studies (GWAS) has
arrived
– Genotyping cost is asymptoting to free [Altman et al.]
– Most (all?) published GWAS are done on carefully
selected and uniformly characterized patient populations
– Time consuming
 Clinical Phenotyping, on the other hand, is lacking
– Slow-throughput
– Costly and time consuming
 How “good” are EMRs (with inconsistencies and biases) as a
source for phenotypes?
Why is this important now?
 Bio-repositories are becoming popular
– Linking biospecimens to personal health data
 Population-based studies for genetic and environmental
conditions and contributions to disease etiology
– Often limited in scope or population diversity
 Clinical trials eligibility
– Cohort identification is always a bottleneck
 Quality metrics and HITECH Act
 Large-scale prospective cohort studies could be facilitated by
availability of complete, standardized, and unbiased data from
EMRs
Pros and Cons of EMR Data for
Phenotyping
 We have a LOT of information about subjects
– Demographics, labs, meds, procedures…
– Team diagnoses as opposed to a diagnoses based on a
single person’s opinion
– Potential for more reliable diagnoses
– Identification of otherwise latent population differences
 Possible issues with using EMR data for phenotyping
– Non-standardized, heterogeneous, unstructured data
– Measured (e.g., demographics) vs. un-measured (e.g.,
socio-economic status) population differences
– Hospital specialization and coding practices
– Population/regional market landscape
But…the challenges can be
addressed…if we
 Develop techniques for standardization and normalization of
clinical data
 Develop techniques for transforming and managing
unstructured clinical text into structured representations
 Develop techniques for resolving missing and inconsistent
data
 Develop a scalable, robust and flexible framework for
demonstrating all of the above in a “real-world setting”
EMR-derived Phenotyping
 Overarching goal
– To develop techniques and algorithms that operate on
normalized EMR data to identify cohorts of potentially
eligible subjects on the basis of disease, symptoms, or
related findings
 Phenotyping (from our perspective)
– Inclusion and exclusion criteria for cohort identification
– Numerator and denominator criteria for clinical quality
metrics
– Trigger criteria for clinical decision support
– …
EMR-based Phenotype Algorithms
 Typical components
– Billing and diagnoses codes
– Procedure codes
– Labs
– Medications
– Phenotype-specific co-variates (e.g., Demographics,
Vitals, Smoking Status, CASI scores)
– Pathology
– Imaging?
 Organized into inclusion and exclusion criteria
 Experience from eMERGE (http://www.gwas.net)
– Electronic Medical Records and Genomics Network
EMR-based Phenotype Algorithms
 Iteratively refine case definitions through partial manual
review to achieve ~PPV ≥ 95%)
 For controls, exclude all potentially overlapping syndromes
and possible matches; iteratively refine such that ~NPV ≥
98%
Example: Type 2 Diabetes (cases)
Challenges
 Algorithm design
– Non-trivial; requires significant expert involvement
– Highly iterative process
– Time-consuming manual chart reviews
– Representation of “phenotypic logic”
 Data access and representation
– Lack of unified vocabularies, data elements, and value
sets
– Questionable reliability of ICD & CPT codes (e,g., omit
codes that don’t pay well, billing the wrong code since it is
easier to find)
– Natural Language Processing needs
 And many more…
Outline
 Background
 On-going projects and updates
 Proposed projects for Year 2
 Productivity till date
Q&A
Current HTP Project Themes
Identification of Clinical Element Models
Phenotyping Execution Logic
Data Quality, Validation and Cost Effectiveness
Project Overview
 Three eMERGE phenotyping algorithms as initial Use Cases
– Type 2 Diabetes Mellitus (T2DM)
– Peripheral Arterial Disease (PAD)
– Hypothyroidism
 Specified computable mappings between CEMs and algorithms
 Classified phenotyping input specifications into two categories:
– General EHR data requirements (Examples: demographics,
diagnoses)
– Phenotype-specific EHR data (Example: Ankle-brachial index
for PAD)
 Proposed semantic types of the input specifications
Semantic Classification Types
 Demographic data (e.g., Gender, Race, Age, etc)
 Physical measurements (e.g., Weight, Height, BMI, etc)
 Diagnosis (ICD codes, SNOMED CT annotations from
problem list, administrative coding workflows, clinical
notes, and etc)
 Procedure (CPT codes, ICD procedure codes)
 Medication
 Laboratory
General Models for Scalability
 Diagnosis
– AdministrativeDiagnosisCode: billing purposes
– ClinicalAssertedDiagnosisCode: problem list, clinical notes, etc
 Medication
– Prescribed/Ordered
– Dispensed
– Administered
 Procedure
– AdministrativeProcedureCode: CPT code, ICD 9 code for
inpatient.
 Laboratory
Mapping Issues
 Secondary use versus patient care meanings
– History of X meaning “evidence of X prior to date Y”
versus history of X statement in text documents
– Diagnosis inputs often validated on ICD-9-CM codes
 Non-standard aggregations
– Fasting glucose test
 Availability of data in EHR
– Age at onset of X
– Medical specialty (ankle brachial index)
– Smoking history/family history (NLP/structured
solutions)
Mapping Considerations
Algorithm inputs are abstractions of EHR content
–
–
–
–
Native content
Generalized content
Computed
Selected content
Common constraints of EHR content
–
–
–
–
Source of data, i.e., EHR application used, encounter type
Allowable codes
Temporal bounds
Relationships among separate observations
Example CEM to Algorithm Map
Content Type
Native
Generalized
Computed
Algorithm Input
CEM
Gender
Administrative
Gender
Non-Statin Lipid
Lowering Drugs
OrderMedAmb
Fasting Blood
Glucose
StandardLabObs
Quantitative
Body Mass Index
BodyWeightMeas
Abstraction
Constraint
Normalized
medication
terminology
chemical
therapeutic class
Allowable
normalized lab
observation codes
function (weight,
height)
Date Range of
Interest
Date Range of
Interest
Example CEM to Algorithm Map - 2
Content Type
Algorithm Input
Selected
Body Mass Index
CEM
BodyMassIndex
Meas
Blood Pressure
BloodPressure
Meas
Abstraction
MAXIMUM
Constraint
1. Date Range of
Interest
2. Date not cooccurs with
Pregnancy
Diagnosis Time
Range
Date occurs 1
month after LAST
(ANTIHYPERTENSIV
E CLASS
(Medication
Assertion))
Current HTP Project Themes
Identification of Clinical Element Models
Phenotyping Execution Logic
Data Quality, Validation and Cost Effectiveness
Drools-based Phenotyping
Architecture
Clinical
Element
Database
Drools
(A long with other technologies)
List of
Patients
for
Specific
Cases
 Workflow authoring by domain experts (clinicians)
 Rule accessibility by clinicians – BPMN, decision tables, DSL;
collaborative authoring
Domain Expert ~
Analyst ~ Developer
Drools-based Phenotyping
Architecture
Clinical
Element
Database
Data Access
Layer
Business Logic
Transformation
Layer
Transform physical representation
 Normalized logical representation
(Fact Model)
Inference
Engine (Drools)
Service for
Creating Output
(File, Database,
etc)
List of
Diabetic
Patients
Drools – Workflow
Diabetes Project Status
 Diabetes Rules are Completed
 Demonstrated the Workflow/Rules for Feedback
 Make Rules “Shareable”
 Performance Validation
 More details in the later session!
Logic Statement
GELLO expression
Patient record flagged as
“Y” with research
Authorization (nothing in
data model to represent
this)
context Patient
def: researchAuthorization:
Boolean =
Exist(Self.explicitConsent = ‘Y’)
QDM expression
DM2 algorithm
If Patient.explicitConsent = ‘Y’
If ResearchAuthorization
Patient age greater than
18 at the start of
measurement period
[1/1/09-12/31/10]
context Patient
def: age: Integer =
let startOfMeasurement =
PointInTime : 1/1/09
startOfMeasurement= 1/1/09
If startOfMeasurement – Patient.birthdate > 18
in StartOfMeasurement –
Self.birthdate
If age > 18
Patient meets at least one
of the following criteria:
Patient has at least 2
clinic (face-to-face
outpatient) visits during
measurement period with
visits coded with a
diabetes ICD-9 CM code
OR
context Patient
def: face2face: Integer =
let startOfMeasurement =
PointInTime : 1/1/09,
let endOfMeasurement =
PointInTime: 12/31/10,
let dmCodes =
{listOfICD9CodesForDM},
let
EncountersWithDMcodes:
startOfMeasurement= 1/1/09
endOfMeasurement = 12/31/10
dmCodes = {listOfICD9CodesForDM}
Countdistinct(Encounter: encounter outpatient DURING
StartOfMEasurement and endOfMeasurement and
Encounter.ClinicalEncounterId in dmCodes) >=2
NQF QDM Criteria
Current HTP Project Themes
Identification of Clinical Element Models
Phenotyping Execution Logic
Data Quality, Validation and Cost Effectiveness
Data Quality: Objectives
 Assess Data variability within and across
institutions
 Assess impact of this variability on Secondary
Use of EMR
 Generate specifications for Widgets
– “Warning Label” for suspect data categories
– Data quality audits with logs
– Batch data correction / removal
More details during the later session!
Centerphase Project
Research Design
Randomly generate ONE sample set of patient records from database:
Based on T2DM ICD9 codes from at least 2 visits during measurement
period
Manual
Process
Study coordinator
(SC) conducts
manual review of
patient charts,
and monitors
activity time
Sample Patient
Records
Algorithm-Driven
Process
Screens 1 -3
Screens 1 -3
Patient
Result Set
Patient
Result Set
Compare time, cost and accuracy of results
Programmer
develops and
runs algorithm
to query
records, and
monitors
development
and run time
Outline
 Background
 On-going projects and updates
 Proposed projects for Year 2
 Productivity till date
Q&A
Project 1: National Library for Clinical
Phenotyping Algorithms
 Current state of the art
– MS Word files: do not scale
– An FTP server: will not work either
– We need…programmatic access, querying, navigation
– Promote re-use (where applicable)
 Research Question: To develop an implementation
independent, phenotyping logic representation template for
algorithm design
– Existing work on Drools, GELLO and NQF
– Leverage CEMs for algorithm design and representation
– Publicly accessible Web-based environment for
phenotyping algorithms
– Validate algorithm deployment in multiple EMR settings
Project 2: Machine Learning and
Phenotyping
 EMR-derived phenotyping algorithm development is tedious,
and time-consuming
– Based on our own experience!
 Research Question: To leverage machine learning methods
for rule/algorithm development, and validate against expert
developed ones
– Use eMERGE library of phenotype algorithms for
validation
– Asthma and Diabetes as initial use-cases
 Preliminary work by Susan
– Work with data normalization and NLP teams
Project 3: Just-in-Time Phenotyping
 The current pipeline prototype is based on a relational
persistence layer
– Access to historical, retrospective data
– Offline processing of data and phenotyping algorithms
 Research Question: To to apply phenotyping algorithms as
“data sniffers” that can be plugged within an UIMA pipeline
– Online, real-time phenotyping (e.g., for clinical decision
support)
– How much data is “necessary”? How much data is
“necessary and sufficient”?
– More active role of NLP techniques
Project 4: Phenotyping Workbench
 EMR-based phenotyping algorithms are hard to design, and
even harder to implement
– Access to domain experts—often a resource issue
– Access to IT/informatics experts—also, a resource issue
– Lot of moving components
 Research Question: To develop a phenotyping “plug & play”
workbench for algorithm design and evaluation
– Visual and graphical algorithm editing (jPBMN)
– Configurable algorithms (Drools code snippets)
– User workspace management (who are these “users”?)
– File-based or database access layer (CEM-based)
– Leverage i2b2 workbench where applicable
– “Plug & Play” is still a big challenge…
Outline
 Background
 On-going projects and updates
 Proposed projects for Year 2
 Productivity till date
Q&A
Productivity till date
 Manuscripts/Abstracts/Posters
– Conway MA, Berg RL, Carrell D, Denny JC, Kho AN, Kullo IJ, Linneman JG,
Pacheco JA, Pessig PL, Rasmussen L, Weston N, Chute CG, Pathak J.
Analyzing Heterogeneity and Complexity of Electronic Health Record
Oriented Phenotyping Algorithms. AMIA 2011 (paper).
– Tao C, Parker CG, Oniki TA, Pathak J, Huff SM, Chute CG. An OWL MetaOntology for Representing the Clinical Element Model. AMIA 2011 (paper).
– Chute CG, Pathak J, Savova GK, Bailey KR, Schor MI, Hart LA, Beebe CE,
Huff SM. The SHARPn Project on Secondary Use of Electronic Medical
Record Data: Progress, Plans and Possibilities. AMIA 2011 (paper).
– Conway MA, Pathak J. Analyzing the Prevalence of Hedges in Electronic
Health Record Oriented Phenotyping Algorithms. AMIA 2011 (poster).
– Tao C, Welch SR, Wei WQ, Oniki TA, Parker CA, Pathak J, Huff SM, Chute
CG. Normalized Representation of Data Elements for Phenotype Cohort
Identification in Electronic Health Record. AMIA 2011 (poster).
 Prototype software
– Drools-based implementation of the diabetes algorithm
Thank You!
Download