A Comparative Study of Supervised Learning as Applied to Acronym Expansion in

advertisement
A Comparative Study of
Supervised Learning
as Applied to
Acronym Expansion in
Clinical Reports
Mahesh Joshi, Serguei Pakhomov,
Ted Pedersen, Christopher G. Chute
University of Minnesota, Duluth
Mayo College of Medicine, Rochester
AMIA-2006
1
Overview
• Acronyms are ambiguous
– in general, and in more specialized domains
• Acronyms can be disambiguated by expansion
– expansions act as senses or definitions
• Acronym expansion can be viewed as word
sense disambiguation
– supervised learning from annotated examples
• Features trump learning algorithms
– unigrams dominant
AMIA-2006
2
AMIA - Top Google Results
•
•
•
•
American Medical Informatics Association
Association of Moving Image Archivists
Anglican Mission in America
Associcion Mutual Israelita Argentina
AMIA-2006
3
RN in Wikipedia
•
•
•
•
•
•
•
Registered Nurse
Royal Navy
Radio National
Radio Nederland
Richard Nixon
Registered Identification Number
Renovacion Nacional
AMIA-2006
4
Acronym Ambiguity not just a
problem for General English…
• 33% of Acronyms in UMLS are ambiguous
– Liu et. al. AMIA-2001
• 81% of Acronyms in MEDLINE abstracts
are ambiguous, with an average of 16
expansions
– Liu et. al. AMIA-2002
AMIA-2006
5
We view AE as WSD
• AE
– sense 1: American Eagle
– sense 2: Arab Emirates
– sense 3: acronym expansion
• WSD
– sense 1: Washington School for the Deaf
– sense 2: web server director
– sense 3: word sense disambiguation
AMIA-2006
6
Methodology
• Identify 16 ambiguous acronyms
– 9 from Pakhomov, et. al. AMIA-2005
– 7 newly annotated for this this study
• Manually annotate in clinical notes
– 7,738 total instances from Mayo Clinic
database of clinical notes
• Use as training data for supervised learning
AMIA-2006
7
Acronyms (majority < 50%)
• AC
–
–
–
–
• LE
– Limited Exam Lower
Extremity
– Initials
– 5 more expansions
Acromioclavicular
Antitussive with Codeine
Acid Controller
10 more
• PE
• APC
–
–
–
–
Argon Plasma Coagulation
Adenomatous Polyposis Coli
Atrial Premature Contraction
10 more expansions
AMIA-2006
–
–
–
–
Pulmonary Embolism
Pressure Equalizing
Patient Education
12 more expansions
8
Acronyms (50% < majority < 80%)
• CP
–
–
–
–
• MCI
• HD
–
–
–
–
Huntington's Disease
Hemodialysis
Hospital Day
9 more expansions
• CF
–
–
–
–
– Mild Cognitive Impairment
– Methylchloroisothiazolinone
– Microwave Communications,
Inc.
– 5 more expansions
Chest Pain
Cerebral Palsy
Cerebellopontine
19 more expansions
Cystic Fibrosis
Cold Formula
Complement Fixation
6 more expansions
• ID
–
–
–
–
Infectious Disease
Identification
Idaho Identified
4 more expansions
• LA
–
–
–
–
AMIA-2006
Long Acting
Person
Left Atrium
5 more expansions
9
Acronyms (majority > 80%)
•
•
•
•
MI
– Myocardial Infarction
– Michigan
– Unknown
– 2 more expansions
ACA
– Adenocarcinoma
– Anterior Cerebral Artery
– Anterior Communication Artery
– 3 more expansions
GE
– Gastroesophageal
– General Exam
– Generose
– General Electric
HA
– Headache
– Hearing Aid
– Hydroxyapatite
– 2 more expansions
•
•
•
•
AMIA-2006
FEN
– Fluids, Electrolytes and Nutrition
– Drug Fen Phen
– Unknown
NSR
– Normal Sinus Rhythm
– Nasoseptal Reconstruction
FEN
– Fluids, Electrolytes and NutritionDrug
– Fen Phen
– Unknown
NSR
– Normal Sinus Rhythm
– Nasoseptal Reconstruction
10
Experimental Objectives
• Compare performance of ML methods
– Naïve Bayesian classifier
– J48/C4.5 Decision Tree Learner
– Support Vector Machine (SMO)
• Compare four different feature sets
– POS tags from Brill-Hepple Tagger
– Unigrams that occur 5 or more times
• flexible window of size 5 around target
– Bigrams that occur 5 or more times
• flexible window of size 5 around target
– Unigrams + Bigrams + POS Tags
AMIA-2006
11
Feature Extraction
•
•
•
•
•
Horizon : up to 5 content words to left and right of target
Boundaries : cross sentences, but not clinical notes
Skip stop words
Bigrams are pairs of contiguous content words
Example (CF is target):
– Unigrams: “If she is found to be a carrier, then they will follow
with CF carrier testing in her husband.”
– Bigrams: “If she is found to be a carrier, then they will follow with
CF carrier testing in her husband.”
AMIA-2006
12
Results (majority < 50%)
Feature Comparison (AC, APC, LE, PE)
100
Accuracy (%)
90
80
70
60
50
40
30
Decision Trees
POS
Naïve Bayes
bigrams
Classifier
unigrams
AMIA-2006
SVM
ALL
Majority
13
Results (50% < majority < 80%)
Feature Comparison (CP, HD, CF, MCI, ID, LA)
100
Accuracy (%)
90
80
70
60
50
40
30
Decision Trees
POS
Naïve Bayes
bigrams
Classifier
unigrams
AMIA-2006
SVM
ALL
Majority
14
Results (majority > 80%)
Feature Comparison (MI, ACA, GE, HA, FEN, NSR)
100
Accuracy (%)
90
80
70
60
50
40
30
Decision Trees
POS
Naïve Bayes
bigrams
Classifier
unigrams
AMIA-2006
SVM
ALL
Majority
15
Results (flexible window)
Fixed vs. Flexible Window Performance
95
Accuracy (%)
90
85
80
75
70
1
2
fixed-bigrams
flexi-bigrams
3
4
5
6
Window Size
fixed-unigrams
flexi-unigrams
AMIA-2006
7
8
9
10
fixed-unigrams+bigrams
flexi-unigrams+bigrams
16
Conclusions
• Overall expansion accuracy at or above
90% regardless of distribution
• Differences in accuracy are largely due to
features, not ML algorithms
• Addition of bigrams and POS tags helps
performance, but unigrams dominant
• Flexible window improves upon fixed
window feature selection
AMIA-2006
17
Future Work
• Expand all acronyms in a text, not just
select few
– expand based on prior expansions
– utilize one sense per discourse constraint
• Integrate supervised methods with
knowledge based approaches and
clustering methods to reduce need for
annotated examples
AMIA-2006
18
Acknowledgments
• We would like to thank our annotators Barbara
Abbott, Debra Albrecht and Pauline Funk.
• This work was supported in part by the NLM
Training Grant (T15 LM07041-19) and the NIH
Roadmap Multidisciplinary Clinical Research
Career Development Award (K12/NICHD)HD49078.
• Dr. Pedersen has been partially supported by a
National Science Foundation Faculty Early
CAREER Development Award (#0092784).
AMIA-2006
19
Software Resources
• GATE (General Architecture for Text Engineering)
– http://gate.ac.uk/
• NSPGate
– http://nspgate.sourceforge.net/
• Ngram Statistics Package
– http://ngram.sourceforge.net/
• WSDGate
– http://wsdgate.sourceforge.net/
• WEKA (Waikato Environment for Knowledge Analysis)
– http://www.cs.waikato.ac.nz/ml/weka/
AMIA-2006
20
Download