A Comparative Study of Supervised Learning as Applied to Acronym Expansion in Clinical Reports Mahesh Joshi, Serguei Pakhomov, Ted Pedersen, Christopher G. Chute University of Minnesota, Duluth Mayo College of Medicine, Rochester AMIA-2006 1 Overview • Acronyms are ambiguous – in general, and in more specialized domains • Acronyms can be disambiguated by expansion – expansions act as senses or definitions • Acronym expansion can be viewed as word sense disambiguation – supervised learning from annotated examples • Features trump learning algorithms – unigrams dominant AMIA-2006 2 AMIA - Top Google Results • • • • American Medical Informatics Association Association of Moving Image Archivists Anglican Mission in America Associcion Mutual Israelita Argentina AMIA-2006 3 RN in Wikipedia • • • • • • • Registered Nurse Royal Navy Radio National Radio Nederland Richard Nixon Registered Identification Number Renovacion Nacional AMIA-2006 4 Acronym Ambiguity not just a problem for General English… • 33% of Acronyms in UMLS are ambiguous – Liu et. al. AMIA-2001 • 81% of Acronyms in MEDLINE abstracts are ambiguous, with an average of 16 expansions – Liu et. al. AMIA-2002 AMIA-2006 5 We view AE as WSD • AE – sense 1: American Eagle – sense 2: Arab Emirates – sense 3: acronym expansion • WSD – sense 1: Washington School for the Deaf – sense 2: web server director – sense 3: word sense disambiguation AMIA-2006 6 Methodology • Identify 16 ambiguous acronyms – 9 from Pakhomov, et. al. AMIA-2005 – 7 newly annotated for this this study • Manually annotate in clinical notes – 7,738 total instances from Mayo Clinic database of clinical notes • Use as training data for supervised learning AMIA-2006 7 Acronyms (majority < 50%) • AC – – – – • LE – Limited Exam Lower Extremity – Initials – 5 more expansions Acromioclavicular Antitussive with Codeine Acid Controller 10 more • PE • APC – – – – Argon Plasma Coagulation Adenomatous Polyposis Coli Atrial Premature Contraction 10 more expansions AMIA-2006 – – – – Pulmonary Embolism Pressure Equalizing Patient Education 12 more expansions 8 Acronyms (50% < majority < 80%) • CP – – – – • MCI • HD – – – – Huntington's Disease Hemodialysis Hospital Day 9 more expansions • CF – – – – – Mild Cognitive Impairment – Methylchloroisothiazolinone – Microwave Communications, Inc. – 5 more expansions Chest Pain Cerebral Palsy Cerebellopontine 19 more expansions Cystic Fibrosis Cold Formula Complement Fixation 6 more expansions • ID – – – – Infectious Disease Identification Idaho Identified 4 more expansions • LA – – – – AMIA-2006 Long Acting Person Left Atrium 5 more expansions 9 Acronyms (majority > 80%) • • • • MI – Myocardial Infarction – Michigan – Unknown – 2 more expansions ACA – Adenocarcinoma – Anterior Cerebral Artery – Anterior Communication Artery – 3 more expansions GE – Gastroesophageal – General Exam – Generose – General Electric HA – Headache – Hearing Aid – Hydroxyapatite – 2 more expansions • • • • AMIA-2006 FEN – Fluids, Electrolytes and Nutrition – Drug Fen Phen – Unknown NSR – Normal Sinus Rhythm – Nasoseptal Reconstruction FEN – Fluids, Electrolytes and NutritionDrug – Fen Phen – Unknown NSR – Normal Sinus Rhythm – Nasoseptal Reconstruction 10 Experimental Objectives • Compare performance of ML methods – Naïve Bayesian classifier – J48/C4.5 Decision Tree Learner – Support Vector Machine (SMO) • Compare four different feature sets – POS tags from Brill-Hepple Tagger – Unigrams that occur 5 or more times • flexible window of size 5 around target – Bigrams that occur 5 or more times • flexible window of size 5 around target – Unigrams + Bigrams + POS Tags AMIA-2006 11 Feature Extraction • • • • • Horizon : up to 5 content words to left and right of target Boundaries : cross sentences, but not clinical notes Skip stop words Bigrams are pairs of contiguous content words Example (CF is target): – Unigrams: “If she is found to be a carrier, then they will follow with CF carrier testing in her husband.” – Bigrams: “If she is found to be a carrier, then they will follow with CF carrier testing in her husband.” AMIA-2006 12 Results (majority < 50%) Feature Comparison (AC, APC, LE, PE) 100 Accuracy (%) 90 80 70 60 50 40 30 Decision Trees POS Naïve Bayes bigrams Classifier unigrams AMIA-2006 SVM ALL Majority 13 Results (50% < majority < 80%) Feature Comparison (CP, HD, CF, MCI, ID, LA) 100 Accuracy (%) 90 80 70 60 50 40 30 Decision Trees POS Naïve Bayes bigrams Classifier unigrams AMIA-2006 SVM ALL Majority 14 Results (majority > 80%) Feature Comparison (MI, ACA, GE, HA, FEN, NSR) 100 Accuracy (%) 90 80 70 60 50 40 30 Decision Trees POS Naïve Bayes bigrams Classifier unigrams AMIA-2006 SVM ALL Majority 15 Results (flexible window) Fixed vs. Flexible Window Performance 95 Accuracy (%) 90 85 80 75 70 1 2 fixed-bigrams flexi-bigrams 3 4 5 6 Window Size fixed-unigrams flexi-unigrams AMIA-2006 7 8 9 10 fixed-unigrams+bigrams flexi-unigrams+bigrams 16 Conclusions • Overall expansion accuracy at or above 90% regardless of distribution • Differences in accuracy are largely due to features, not ML algorithms • Addition of bigrams and POS tags helps performance, but unigrams dominant • Flexible window improves upon fixed window feature selection AMIA-2006 17 Future Work • Expand all acronyms in a text, not just select few – expand based on prior expansions – utilize one sense per discourse constraint • Integrate supervised methods with knowledge based approaches and clustering methods to reduce need for annotated examples AMIA-2006 18 Acknowledgments • We would like to thank our annotators Barbara Abbott, Debra Albrecht and Pauline Funk. • This work was supported in part by the NLM Training Grant (T15 LM07041-19) and the NIH Roadmap Multidisciplinary Clinical Research Career Development Award (K12/NICHD)HD49078. • Dr. Pedersen has been partially supported by a National Science Foundation Faculty Early CAREER Development Award (#0092784). AMIA-2006 19 Software Resources • GATE (General Architecture for Text Engineering) – http://gate.ac.uk/ • NSPGate – http://nspgate.sourceforge.net/ • Ngram Statistics Package – http://ngram.sourceforge.net/ • WSDGate – http://wsdgate.sourceforge.net/ • WEKA (Waikato Environment for Knowledge Analysis) – http://www.cs.waikato.ac.nz/ml/weka/ AMIA-2006 20