Expert-curated features - Journal of the American Medical

advertisement
Supplementary Material
Sheng Yu1,2,3,*, Katherine P. Liao2,3, Stanley Y. Shaw4, Vivian S. Gainer5, Susanne E.
Churchill5, Peter Szolovits6, Shawn N. Murphy4,5, Isaac Kohane3,7, and Tianxi Cai8
Partners HealthCare Personalized Medicine, Boston, MA; 2Brigham and Women’s
1
Hospital, Boston, MA; 3Harvard Medical School, Boston, MA; 4Massachusetts
General Hospital, Boston, MA; 5Research Computing, Partners HealthCare,
Charlestown, MA; 6Massachusetts Institute of Technology, Cambridge, MA; 7Boston
Children’s Hospital, Boston, MA; 8Harvard School of Public Health, Boston, MA
* Corresponding author: Phone: (617) 800-6852, Email address: syu7@partners.org
Formulation of the concept mapping problem
The detected terms can usually map to multiple concepts in UMLS, i.e., there are multiple
possible meanings for a term. For example, “RA” can correspond to Radium (UMLS
C0034625), Rheumatoid Arthritis (C0003873), Radiography (C0034571), and Refractory
Anemia (C0002893). (We use quotations to denote terms, such as “RA”, and capital letters for
concepts, such as Rheumatoid Arthritis.) To disambiguate the term senses, observe that the
same concept can appear in the article in the form of different terms, i.e., multiple terms can
share a common concept. Thus, we search for a minimum set of concepts that cover all the
identified terms and select them as the intended concepts. For example, if “rheumatoid arthritis”
also appears in the article, which can map to Rheumatoid Arthritis (Disease or Syndrome)
(C0003873) and Rheumatoid Arthritis (Gene or Genome) (C2697391), we would then select
Rheumatoid Arthritis (Disease or Syndrome) (C0003873), because it covers both “rheumatoid
arthritis” and “RA”. Similarly, in cases where the UMLS concept is an umbrella of multiple
more granular but synonymous concepts, the minimum covering set solution would select the
umbrella concept. For instance, the concept C0010054 includes the terms “coronary artery
disease”, “myocardial ischemia”, and “coronary atherosclerosis”, each of which represent a
more granular concept” (C1956346, C0151744, and C2733225, respectively); these three subconcepts are clinically essentially synonymous, and are automatically grouped together (via the
umbrella concept) to reduce the model dimension.
The following is a mathematical formulation of the concept mapping solution. When a term is
detected from the article, it is associated with a list of concepts that are possible meanings of it.
So denote the detected terms as T1 ,..., TN , and the union of their associated concepts as
C1 ,..., CM . Use a ij  0 or 1 to represent whether Ti is a term of C j . The goal is to use the
minimum number of concepts to cover all the terms. Use x j  0 or 1, j  1,..., M , to denote
whether we select concept C j , then we want to solve the following optimization problem.
min { x
M
j } j 1
subject to


M
j 1
xj
M
aij x j  1 for i  1,..., N
x j  0 or 1 for j  1,..., M
j1
The above formulation is a binary program and it can be relaxed to the following linear program
that can be solved with the simplex algorithm with guaranteed binary optimal solution:
min { x
M
j } j 1
subject to


M
j 1
xj
M
aij x j  1 for i  1,..., N
xj  0
for j  1,..., M
j1
Alternatively, one can use a greedy algorithm that selects from the concepts that cover the most
detected terms, which is what we did in the paper. We did not prove whether this greedy
algorithm guarantees optimality.
The optimal solution is usually not unique. For example, it is quite often that a term is associated
with multiple concepts, but none of which is shared with other detected terms. In such case, we
chose the concept that has the most occurrences across all the source vocabularies of UMLS.
Selection of UMLS semantic types for the tests
Sign or Symptom; Finding; Pathologic Function; Biologic Function; Phenomenon or Process;
Acquired Abnormality; Congenital Abnormality; Anatomical Abnormality; Disease or
Syndrome; Neoplastic Process; Injury or Poisoning; Mental or Behavioral Dysfunction;
Diagnostic Procedure; Therapeutic or Preventive Procedure; Amino Acid, Peptide, or Protein;
Antibiotic; Biologically Active Substance; Biomedical or Dental Material; Carbohydrate;
Chemical; Chemical Viewed Functionally; Chemical Viewed Structurally; Clinical Drug;
Eicosanoid; Element, Ion, or Isotope; Enzyme; Hazardous or Poisonous Substance; Hormone;
Immunologic Factor; Indicator, Reagent, or Diagnostic Aid; Inorganic Chemical; Lipid;
Neuroreactive Substance or Biogenic Amine; Nucleic Acid, Nucleoside, or Nucleotide;
Organic Chemical; Organophosphorus Compound; Pharmacologic Substance; Receptor;
Steroid; Vitamin; Laboratory or Test Result; Laboratory Procedure; Individual Behavior;
Medical Device.
Rule-based cleaning
After the concept mapping, the program removes terms that are not informative and terms
whose mappings are not reliable to reduce the chance of false detection. Specifically, terms of
the following types are removed from the output.

General exclusion. Some UMLS concepts are overly generic and thus not informative in
classifying diseases. For example, C0037088 Signs and Symptoms, C0332293 Treated,
C0087111 Therapy, etc., are too non-specific to be useful for classification. A list of such
concepts have been compiled for the program, and terms whose mapped concepts are in
the list are excluded.

Brand name drugs. Since a lot of brand names coincide with common words (e.g.,
C0310367 Today, an antibiotic), the terms that are detected as brand names at this stage
are likely to be false detections. It is safe to exclude all brand names at this stage, because
when encyclopedia articles mention a brand name, the generic name is most likely
mentioned as well. The program will recover all brand name drugs via the corresponding
generic name at a later stage.

“Unpreferred” terms. Sometimes, an incorrect mapping occurs because the terms recorded
in UMLS are not ideal. For example, C0795691 Heart Problem has a term “heart”
(A0418118), and C0340708 Deep Vein Thrombosis of Lower Extremity has a term “deep
vein thrombosis” (A18637666). To filter out these incorrect mappings, the program checks
the TTY attribute in UMLS for each term and its mapped concept, which indicates how
the term is recorded in the source vocabularies: as a preferred term of the concept, a
synonym, a permutation of words, etc. The program checks different source vocabularies
for different semantic types, for example, RxNorm and NDF-RT for chemicals and drugs.

Abbreviations. Abbreviations are the most difficult terms to disambiguate[1–4]. The
program excludes all abbreviations at this stage, and it is safe to do so: If the program
removes a term whose mapping is wrong, it is beneficial; even if the mapping is correct,
no information would be lost by removing an abbreviation of an important concept, since
the full name of the concept is generally introduced prior to the abbreviation and will be
identified. For instance, excluding “RA” is fine, because the concept C0003873 that it
represents is kept via the full name “rheumatoid arthritis”.
More discussion on drug grouping
Since there is no universal agreement among terminologies about the correct hierarchy of these
classes, Figure 1 shows that a drug such as Aspirin may appear both directly under C0003211
Non-steroidal Anti-inflammatory Agents and under C0036077 Salicylates, which in turn is
under C0003211. Drugs may also appear under classes not specified to be related to each other,
such as C0025677 Methotrexate, which is cross-listed in C0016411 Folic Acid Antagonists,
C0003191 Antirheumatic Agents, and C0021081 Immunosuppressive Agents. Using the
retrieved relations, AFEP may also suggests classes not mentioned in the knowledge sources to
group ungrouped concepts, e.g., “Consider adding class C0282651 Selectins
to group C0115305 E-Selectin and C0134835 P-Selectin”, since Selectins
is a common class for both E-Selectin and P-Selectin. AFEP also drops concepts if they are
dose forms by checking the concept names with regular expressions, e.g., “C2710124
Prasugrel 5 mg is a dose form and is discarded”. All of the generic drugs
and drug classes are retained as candidate features. Subsequent feature selection steps will
decide which ones to use.
C0003211 anti-inflammatory agents, non-steroidal
└ C0053959 boswellic acid
└ C0010467 curcumin
└ C1257954 cyclooxygenase 2 inhibitors
└ C0538927 celecoxib
└ C0031990 piroxicam
└ C0022635 ketoprofen
└ C0027396 naproxen
└ C0036077 salicylates
└ C0036078 sulfasalazine
└ C0004057 aspirin
└ C0012091 diclofenac
└ C0358504 diclofenac topical products
└ C1252196 diclofenac topical gel
└ C0282131 diclofenac potassium
└ C0700583 diclofenac sodium
└ C0020740 ibuprofen
└ C0004057 aspirin
Figure 1: Example drug grouping result from AFEP (brand names are not shown)
Expert-curated features
Expert-curated Features for RA
Age, Sex, ICD-9 RA, ICD-9 psoriatic arthritis (PsA), ICD-9 systemic lupus erythematosus
(SLE), ICD-9 juvenile rheumatoid arthritis (JRA), ICD-9 RA normalized, Codified
methotrexate, Codified anti-TNF, Codified other disease-modifying antirheumatic drugs
(DMARD), Codified rheumatoid factor (RF) negative, Codified RF positive, NLP RA, NLP
SLE, NLP JRA, NLP PsA, NLP methotrexate, NLP anti-TNF, NLP other DMARD, NLP anticyclic citrullinated peptide positive, NLP RF, NLP seropositive, NLP erosions.
ICD-9 RA normalized = ln (no. of ICD-9 RA codes per subject >= 1 week apart).
Codified anti-TNF = etanercept and infliximab (adalimumab was not available in our EMR).
NLP anti-TNF = adalimumab, etanercept, and infliximab.
Expert-curated Features for CAD
Age, Sex, Race, ICD-9 CAD, ICD-9 ischemic heart disease (IHD), ICD-9 angina, Inpatient
ICD-9 CAD, Inpatient ICD-9 IHD, Inpatient ICD-9 angina, Codified statin, Codified aspirin,
Codified beta blockers, Codified ACE inhibitors, Codified Plavix, NLP CAD, NLP positive
stress test, NLP CAD procedures, Codified insulin, ICD-9 hypertension, ICD-9 diabetes, ICD9 hyperlipidemia, NLP BMI value, NLP diabetes, NLP family history of CAD, NLP current
smoker, NLP past smoker, NLP never smoker, EHR follow-up time (months), Codified
echocardiogram performed, Codified mean low-density lipoprotein, NLP positive troponin,
Total number of ICD-9 codes, Codified coronary artery bypass graft surgery and percutaneous
coronary intervention.
Model coefficients
RA.NLP
1.037
RA.ICD
0.655
MORNING STIFFNESS
0.462
ACUTE PHASE PROTEINS
0.008
DELAYED RELEASE
-0.124
MODIFIED RELEASE
-0.156
NOTE_COUNT
-0.527
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
Figure 2: Nonzero coefficients for rheumatoid arthritis
0.8
1
1.2
CAD.NLP
CAD.ICD
PTCA
CORONARY ARTERY BYPASS
MYOCARDIAL INFARCTION
LIPID LOWERING AGENTS
NITROGLYCERIN
CHRONIC KIDNEY DISEASE
ANGIOPLASTY
CATHETERIZATION
ATORVASTATIN
BETA BLOCKERS
ASPIRIN
CLOPIDOGREL
ANTI ARRHYTHMIC
ANTIPLATELET AGENTS
DIABETES MELLITUS
POTASSIUM
CALCIUM
LISINOPRIL
CHOLESTEROL LEVELS
HYPERLIPIDEMIA
ELECTROCARDIOGRAM
EMERGENCY
NOTE_COUNT
TOBACCO
OXYGEN
0.886
0.862
0.255
0.241
0.238
0.169
0.081
0.068
0.056
0.043
0.039
0.017
0.017
0.008
0.007
0.003
-0.005
-0.048
-0.049
-0.060
-0.060
-0.112
-0.130
-0.153
-0.180
-0.228
-0.268
-0.4
-0.2
0.0
0.2
0.4
0.6
0.8
1.0
Figure 3: Nonzero coefficients for coronary artery disease
False detections in NLP
We noted instances of false detection in NLP. C1707664 Delayed Release and C1709058
Modified Release in the RA model were false detections. C1707664 Delayed Release in UMLS
contained a term “dr”, which means “doctor” or “drive” (as in street address) in most clinical
notes. Since C1709058 Modified Release includes C1707664 Delayed Release, occurrences of
the latter also counted to that of the former. Thus the detections of Delayed Release and
Modified Release were mostly false, and their data highly resembled the note count. In fact,
removing the two features manually and refitting the model gave the same AUC, and the
coefficient of note count became -0.895, confirming that delayed release and modified release
can be entirely replaced by note count. Unfortunately the concept screening in Section 2.4 was
not able to remove them, because Rule 2 was applied only to non-chemical concepts. In the
CAD model, a portion of mentions of oxygen were false detections, because C0030054 Oxygen
contained a term “o”, which was also the bullet symbol in our hospital's EHR. All the false
detections above are left in the data that produced the results in the paper.
References
1 Wu Y, Denny JC, Rosenbloom ST, et al. A comparative study of current clinical natural
language processing systems on handling abbreviations in discharge summaries. AMIA Annu
Symp Proc 2012;2012:997.
2 Kuhn IF. Abbreviations and acronyms in healthcare: when shorter isn’t sweeter. Pediatr
Nurs 2007;33:392–8.
3 Sheppard JE, Weidner LCE, Zakai S, et al. Ambiguous abbreviations: an audit of
abbreviations in paediatric note keeping. Arch Dis Child 2008;93:204–6.
doi:10.1136/adc.2007.128132
4 Xu H, Stetson PD, Friedman C. A Study of Abbreviations in Clinical Notes. AMIA Annu
Symp Proc 2007;2007:821.
Download