Supplementary Material Sheng Yu1,2,3,*, Katherine P. Liao2,3, Stanley Y. Shaw4, Vivian S. Gainer5, Susanne E. Churchill5, Peter Szolovits6, Shawn N. Murphy4,5, Isaac Kohane3,7, and Tianxi Cai8 Partners HealthCare Personalized Medicine, Boston, MA; 2Brigham and Women’s 1 Hospital, Boston, MA; 3Harvard Medical School, Boston, MA; 4Massachusetts General Hospital, Boston, MA; 5Research Computing, Partners HealthCare, Charlestown, MA; 6Massachusetts Institute of Technology, Cambridge, MA; 7Boston Children’s Hospital, Boston, MA; 8Harvard School of Public Health, Boston, MA * Corresponding author: Phone: (617) 800-6852, Email address: syu7@partners.org Formulation of the concept mapping problem The detected terms can usually map to multiple concepts in UMLS, i.e., there are multiple possible meanings for a term. For example, “RA” can correspond to Radium (UMLS C0034625), Rheumatoid Arthritis (C0003873), Radiography (C0034571), and Refractory Anemia (C0002893). (We use quotations to denote terms, such as “RA”, and capital letters for concepts, such as Rheumatoid Arthritis.) To disambiguate the term senses, observe that the same concept can appear in the article in the form of different terms, i.e., multiple terms can share a common concept. Thus, we search for a minimum set of concepts that cover all the identified terms and select them as the intended concepts. For example, if “rheumatoid arthritis” also appears in the article, which can map to Rheumatoid Arthritis (Disease or Syndrome) (C0003873) and Rheumatoid Arthritis (Gene or Genome) (C2697391), we would then select Rheumatoid Arthritis (Disease or Syndrome) (C0003873), because it covers both “rheumatoid arthritis” and “RA”. Similarly, in cases where the UMLS concept is an umbrella of multiple more granular but synonymous concepts, the minimum covering set solution would select the umbrella concept. For instance, the concept C0010054 includes the terms “coronary artery disease”, “myocardial ischemia”, and “coronary atherosclerosis”, each of which represent a more granular concept” (C1956346, C0151744, and C2733225, respectively); these three subconcepts are clinically essentially synonymous, and are automatically grouped together (via the umbrella concept) to reduce the model dimension. The following is a mathematical formulation of the concept mapping solution. When a term is detected from the article, it is associated with a list of concepts that are possible meanings of it. So denote the detected terms as T1 ,..., TN , and the union of their associated concepts as C1 ,..., CM . Use a ij 0 or 1 to represent whether Ti is a term of C j . The goal is to use the minimum number of concepts to cover all the terms. Use x j 0 or 1, j 1,..., M , to denote whether we select concept C j , then we want to solve the following optimization problem. min { x M j } j 1 subject to M j 1 xj M aij x j 1 for i 1,..., N x j 0 or 1 for j 1,..., M j1 The above formulation is a binary program and it can be relaxed to the following linear program that can be solved with the simplex algorithm with guaranteed binary optimal solution: min { x M j } j 1 subject to M j 1 xj M aij x j 1 for i 1,..., N xj 0 for j 1,..., M j1 Alternatively, one can use a greedy algorithm that selects from the concepts that cover the most detected terms, which is what we did in the paper. We did not prove whether this greedy algorithm guarantees optimality. The optimal solution is usually not unique. For example, it is quite often that a term is associated with multiple concepts, but none of which is shared with other detected terms. In such case, we chose the concept that has the most occurrences across all the source vocabularies of UMLS. Selection of UMLS semantic types for the tests Sign or Symptom; Finding; Pathologic Function; Biologic Function; Phenomenon or Process; Acquired Abnormality; Congenital Abnormality; Anatomical Abnormality; Disease or Syndrome; Neoplastic Process; Injury or Poisoning; Mental or Behavioral Dysfunction; Diagnostic Procedure; Therapeutic or Preventive Procedure; Amino Acid, Peptide, or Protein; Antibiotic; Biologically Active Substance; Biomedical or Dental Material; Carbohydrate; Chemical; Chemical Viewed Functionally; Chemical Viewed Structurally; Clinical Drug; Eicosanoid; Element, Ion, or Isotope; Enzyme; Hazardous or Poisonous Substance; Hormone; Immunologic Factor; Indicator, Reagent, or Diagnostic Aid; Inorganic Chemical; Lipid; Neuroreactive Substance or Biogenic Amine; Nucleic Acid, Nucleoside, or Nucleotide; Organic Chemical; Organophosphorus Compound; Pharmacologic Substance; Receptor; Steroid; Vitamin; Laboratory or Test Result; Laboratory Procedure; Individual Behavior; Medical Device. Rule-based cleaning After the concept mapping, the program removes terms that are not informative and terms whose mappings are not reliable to reduce the chance of false detection. Specifically, terms of the following types are removed from the output. General exclusion. Some UMLS concepts are overly generic and thus not informative in classifying diseases. For example, C0037088 Signs and Symptoms, C0332293 Treated, C0087111 Therapy, etc., are too non-specific to be useful for classification. A list of such concepts have been compiled for the program, and terms whose mapped concepts are in the list are excluded. Brand name drugs. Since a lot of brand names coincide with common words (e.g., C0310367 Today, an antibiotic), the terms that are detected as brand names at this stage are likely to be false detections. It is safe to exclude all brand names at this stage, because when encyclopedia articles mention a brand name, the generic name is most likely mentioned as well. The program will recover all brand name drugs via the corresponding generic name at a later stage. “Unpreferred” terms. Sometimes, an incorrect mapping occurs because the terms recorded in UMLS are not ideal. For example, C0795691 Heart Problem has a term “heart” (A0418118), and C0340708 Deep Vein Thrombosis of Lower Extremity has a term “deep vein thrombosis” (A18637666). To filter out these incorrect mappings, the program checks the TTY attribute in UMLS for each term and its mapped concept, which indicates how the term is recorded in the source vocabularies: as a preferred term of the concept, a synonym, a permutation of words, etc. The program checks different source vocabularies for different semantic types, for example, RxNorm and NDF-RT for chemicals and drugs. Abbreviations. Abbreviations are the most difficult terms to disambiguate[1–4]. The program excludes all abbreviations at this stage, and it is safe to do so: If the program removes a term whose mapping is wrong, it is beneficial; even if the mapping is correct, no information would be lost by removing an abbreviation of an important concept, since the full name of the concept is generally introduced prior to the abbreviation and will be identified. For instance, excluding “RA” is fine, because the concept C0003873 that it represents is kept via the full name “rheumatoid arthritis”. More discussion on drug grouping Since there is no universal agreement among terminologies about the correct hierarchy of these classes, Figure 1 shows that a drug such as Aspirin may appear both directly under C0003211 Non-steroidal Anti-inflammatory Agents and under C0036077 Salicylates, which in turn is under C0003211. Drugs may also appear under classes not specified to be related to each other, such as C0025677 Methotrexate, which is cross-listed in C0016411 Folic Acid Antagonists, C0003191 Antirheumatic Agents, and C0021081 Immunosuppressive Agents. Using the retrieved relations, AFEP may also suggests classes not mentioned in the knowledge sources to group ungrouped concepts, e.g., “Consider adding class C0282651 Selectins to group C0115305 E-Selectin and C0134835 P-Selectin”, since Selectins is a common class for both E-Selectin and P-Selectin. AFEP also drops concepts if they are dose forms by checking the concept names with regular expressions, e.g., “C2710124 Prasugrel 5 mg is a dose form and is discarded”. All of the generic drugs and drug classes are retained as candidate features. Subsequent feature selection steps will decide which ones to use. C0003211 anti-inflammatory agents, non-steroidal └ C0053959 boswellic acid └ C0010467 curcumin └ C1257954 cyclooxygenase 2 inhibitors └ C0538927 celecoxib └ C0031990 piroxicam └ C0022635 ketoprofen └ C0027396 naproxen └ C0036077 salicylates └ C0036078 sulfasalazine └ C0004057 aspirin └ C0012091 diclofenac └ C0358504 diclofenac topical products └ C1252196 diclofenac topical gel └ C0282131 diclofenac potassium └ C0700583 diclofenac sodium └ C0020740 ibuprofen └ C0004057 aspirin Figure 1: Example drug grouping result from AFEP (brand names are not shown) Expert-curated features Expert-curated Features for RA Age, Sex, ICD-9 RA, ICD-9 psoriatic arthritis (PsA), ICD-9 systemic lupus erythematosus (SLE), ICD-9 juvenile rheumatoid arthritis (JRA), ICD-9 RA normalized, Codified methotrexate, Codified anti-TNF, Codified other disease-modifying antirheumatic drugs (DMARD), Codified rheumatoid factor (RF) negative, Codified RF positive, NLP RA, NLP SLE, NLP JRA, NLP PsA, NLP methotrexate, NLP anti-TNF, NLP other DMARD, NLP anticyclic citrullinated peptide positive, NLP RF, NLP seropositive, NLP erosions. ICD-9 RA normalized = ln (no. of ICD-9 RA codes per subject >= 1 week apart). Codified anti-TNF = etanercept and infliximab (adalimumab was not available in our EMR). NLP anti-TNF = adalimumab, etanercept, and infliximab. Expert-curated Features for CAD Age, Sex, Race, ICD-9 CAD, ICD-9 ischemic heart disease (IHD), ICD-9 angina, Inpatient ICD-9 CAD, Inpatient ICD-9 IHD, Inpatient ICD-9 angina, Codified statin, Codified aspirin, Codified beta blockers, Codified ACE inhibitors, Codified Plavix, NLP CAD, NLP positive stress test, NLP CAD procedures, Codified insulin, ICD-9 hypertension, ICD-9 diabetes, ICD9 hyperlipidemia, NLP BMI value, NLP diabetes, NLP family history of CAD, NLP current smoker, NLP past smoker, NLP never smoker, EHR follow-up time (months), Codified echocardiogram performed, Codified mean low-density lipoprotein, NLP positive troponin, Total number of ICD-9 codes, Codified coronary artery bypass graft surgery and percutaneous coronary intervention. Model coefficients RA.NLP 1.037 RA.ICD 0.655 MORNING STIFFNESS 0.462 ACUTE PHASE PROTEINS 0.008 DELAYED RELEASE -0.124 MODIFIED RELEASE -0.156 NOTE_COUNT -0.527 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 Figure 2: Nonzero coefficients for rheumatoid arthritis 0.8 1 1.2 CAD.NLP CAD.ICD PTCA CORONARY ARTERY BYPASS MYOCARDIAL INFARCTION LIPID LOWERING AGENTS NITROGLYCERIN CHRONIC KIDNEY DISEASE ANGIOPLASTY CATHETERIZATION ATORVASTATIN BETA BLOCKERS ASPIRIN CLOPIDOGREL ANTI ARRHYTHMIC ANTIPLATELET AGENTS DIABETES MELLITUS POTASSIUM CALCIUM LISINOPRIL CHOLESTEROL LEVELS HYPERLIPIDEMIA ELECTROCARDIOGRAM EMERGENCY NOTE_COUNT TOBACCO OXYGEN 0.886 0.862 0.255 0.241 0.238 0.169 0.081 0.068 0.056 0.043 0.039 0.017 0.017 0.008 0.007 0.003 -0.005 -0.048 -0.049 -0.060 -0.060 -0.112 -0.130 -0.153 -0.180 -0.228 -0.268 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 Figure 3: Nonzero coefficients for coronary artery disease False detections in NLP We noted instances of false detection in NLP. C1707664 Delayed Release and C1709058 Modified Release in the RA model were false detections. C1707664 Delayed Release in UMLS contained a term “dr”, which means “doctor” or “drive” (as in street address) in most clinical notes. Since C1709058 Modified Release includes C1707664 Delayed Release, occurrences of the latter also counted to that of the former. Thus the detections of Delayed Release and Modified Release were mostly false, and their data highly resembled the note count. In fact, removing the two features manually and refitting the model gave the same AUC, and the coefficient of note count became -0.895, confirming that delayed release and modified release can be entirely replaced by note count. Unfortunately the concept screening in Section 2.4 was not able to remove them, because Rule 2 was applied only to non-chemical concepts. In the CAD model, a portion of mentions of oxygen were false detections, because C0030054 Oxygen contained a term “o”, which was also the bullet symbol in our hospital's EHR. All the false detections above are left in the data that produced the results in the paper. References 1 Wu Y, Denny JC, Rosenbloom ST, et al. A comparative study of current clinical natural language processing systems on handling abbreviations in discharge summaries. AMIA Annu Symp Proc 2012;2012:997. 2 Kuhn IF. Abbreviations and acronyms in healthcare: when shorter isn’t sweeter. Pediatr Nurs 2007;33:392–8. 3 Sheppard JE, Weidner LCE, Zakai S, et al. Ambiguous abbreviations: an audit of abbreviations in paediatric note keeping. Arch Dis Child 2008;93:204–6. doi:10.1136/adc.2007.128132 4 Xu H, Stetson PD, Friedman C. A Study of Abbreviations in Clinical Notes. AMIA Annu Symp Proc 2007;2007:821.