Rheumatoid Arthritis, an i2b2 Driving Biology Project

advertisement
i2b2 Rheumatoid Arthritis DBP
Defining RA in the electronic health
record for future studies
Elizabeth Karlson, MD
Associate Professor of Medicine
Harvard Medical School
Brigham and Women’s Hospital
Background: Partners Resources
• i2b2: “Informatics for Integrating Biology and the
Bedside”
• RPDR: “Research Patient Data Repository”
• Natural Language Processing (HiTEX)
• Gold standard dataset:
– Training set: 500 manual chart reviews
– Validation set: 400 manual chart reviews
Coded data
• ICD-9 codes for RA
• ICD-9 codes for related phenotypes
– Lupus (SLE), psoriatic arthritis (PsA), juvenile
inflammatory arthritis (JIA)
• Lab results for RA related antibodies
– Rheumatoid factor (RF), anti-CCP
• Medications
– physician entry, escripts
NLP Concepts
NLP queries
– Rheumatoid arthritis
– RA-related antibodies
• Anti-CCP/RF/seropositive
• Result coded as positive/negative
– RA Medications
• Coded as any mention
– Radiographs: RA erosions
• Coded as any erosion
Approach to develop RA cohort
≥ 1 ICD RA
n=25,830
OR
Anti-CCP
n=3,602
RA Mart
n=29, 432
Classification
algorithm
Training set
n=500
training
↑ Sensitivity
Predicted
RA Cases
n=3,585
Validation set
n=400
↑ Specificity
Classification algorithm
Step 1: Develop gold standard training set
Step 2: Identify variables important for predicting RA
Step 3: Develop algorithm
Chart review results
• RA Mart, N=32,000
– ICD9 = 714.xxx
OR
– CCP test ordered
• Manual chart review for 500 patients
– 20% validation rate
– definite RA=100
– possible/no RA= 400
Comparison of NLP to manual
chart review
• Precision of NLP queries
– Methotrexate
– Etanercept
– CCP+
– Seropositive
– Erosion
100%
100%
98.7%
96%
88%
Approach to develop RA cohort
≥ 1 ICD RA
n=25,830
OR
Anti-CCP
n=3,602
RA Mart
n=29, 432
Classification
algorithm
Training set
n=500
training
↑ Sensitivity
Predicted
RA Cases
n=3,585
Validation set
n=400
↑ Specificity
Classification algorithm
Step 2: Define variables
(Vivian Gainer, Sergey Goryachev, Qing Zeng-Treitler, Shawn Murphy)
• Codified data
– ICD9 billing codes
– Electronic medication prescription
– CCP, RF lab results
• Narrative data extracted using natural language processing (NLP), i.e.
from physician notes, radiology reports
– Erosions
– RF positive, CCP positive, seropositive
– RA medications
Approach to develop RA cohort
≥ 1 ICD RA
n=25,830
OR
Anti-CCP
n=3,602
RA Mart
n=29, 432
Classification
algorithm
Training set
n=500
training
↑ Sensitivity
Predicted
RA Cases
n=3,585
Validation set
n=400
↑ Specificity
Classification algorithm
Step 3: Develop algorithm
(Tianxi Cai)
• Penalized logistic regression with adaptive LASSO
• Parsimonious predictors selected based on BIC
Model
RA
PPV
(%)
Sensitivity
(%)
Difference in
PPV
Narrative
+
Codified
3585
94
63
reference
Codified only
3046
88
51
6
NLP only
3341
89
56
5
Algorithms
Published administrative codified criteria
≥ 3 ICD9 RA
7960
56
80
38
≥1 ICD9RA + med
7799
45
66
49
Top 5 predictive variables for RA
Variable
Standardized regression
coefficient
Standard error
NLP rheumatoid arthritis
1.11
0.48
NLP seropositive
0.74
0.26
ICD9 RA normalized
0.71
0.23
ICD9 RA
0.66
0.44
NLP erosions
0.46
0.29
i2b2 RA cohort
Characteristic
s
I2b2 RA,
n=3,585
CORRONA*,
n=7,971
Age, mean
(SD)
57.5 (17.5)
58.9 (13.4)
Women (%)
79.9
74.5
63
N/A
RF+ (%)
74.4
72.1
Erosions (%)
59.2
52.8
MTX use (%)
59.5
52.8
TNFi use (%)
32.6
22.6
Anti-CCP+ (%)
*Consortium of Rheumatology Researchers of North America
Liao, et al., Arthritis Care & Research 2010
i2b2 Virtual RA Cohort Studies
• Case-control cohort
– ~4,000 RA cases
– ~13,000 matched non-RA controls
• Age, gender, race and health care utilization
• Samples collected from 1500 cases/1500
controls for genotyping
– Genetic risk score predicts RA with same magnitude
as in GWAS (Kurreman, 2010)
– CAD outcomes in RA cases being validated in i2b2
• Pharmacogenetics Research Network (PGRN)
i2b2 RA Project:
• Selected codified
data from RPDR
• Performed NLP
queries for RA
features
• Developed algorithm
based on:
coded + NLP data
Liao, 2010
PGRN Methods:
• Select codified data
from RPDR (meds)
• Perform NLP queries
for RA disease activity
features
• Develop algorithm (s)
based on:
Meds + NLP data
PGRN Specific Aims
• Aim 1: Define RA disease activity level in the
EMR
• Aim 2: Develop an algorithm to predict RA
disease activity from EMR data
• Aim 3: Define temporal relations between RA
medications and disease activity to define to
define treatment response in RA
Background
• In RA, disease activity score (DAS28) is
considered the gold standard tool to
evaluate disease activity and response to
treatment in clinical practice
• DAS28 has 2 components:
– Disease activity level
– Change in disease activity level
• Disease activity level scored as low, moderate, high
• Disease activity change scored as low, moderate, high
Van Gestel AM et al. Arthritis Rheum 1996; 39: 34-40
Research Methods
• Construct a virtual cohort of RA patients (N=5906)
• Review charts for disease activity (document level)
–
–
–
–
–
Remission
Low
Moderate
High
Indeterminant

Remission/Low vs. High/Moderate
• Annotate charts for disease activity features (Knowtator)
–
–
–
–
–
–
–
Disease_disorder
Symptoms (reported pain, stiffness, swelling)
Signs (objective tenderness, limited range of motion, synovitis)
Anatomic site (relations with signs and symptoms)
RA medication signature
RA labs, level of inflammation (CRP, ESR)
Patient functioning (activities of daily living)
NLP Methods
• Move from keyword matching in i2b2 to ontology
mapping in PGRN
• Customize cTAKES for
– RA medications
– RA anatomic sites
• Find relations between entities
• Define new modules
–
–
–
–
RA medication changes (start/stop)
Reasons to stop medications
Lab values
Patient functioning status
NLP Analytic Approaches
1- Internal gold standard datasets
– N=200 BWH annotated notes
– N= 200 MGH annotated notes
2- Analyses
– Study whether MD summary (1-3 sentences) predicts disease
activity
– SVM: construct vectors based on features and relations to
predict disease activity
– Bag of concepts to predict disease activity
2- External gold standard datasets:
– DAS28 scores from standardized tool at MGH matched to
clinical note
– DAS28 scores from BRASS matched to clinical note
Future work
• Define temporal relations between antiTNF medication use (eg. new starts) and
pre and post start disease activity to define
response to therapy
– Construct disease activity timeline (patient
level)
– Construct medication timeline (patient level)
Use NLP to define temporal sequence of medication
start and adverse event
Questions?
Download