2013-hmp668-wk08-hanauer

advertisement
Author(s): David Hanauer
License: Unless otherwise noted, this material is made available under the terms of
the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 License:
http://creativecommons.org/licenses/by-nc-sa/3.0/
We have reviewed this material in accordance with U.S. Copyright Law and have tried to maximize your ability to use,
share, and adapt it. The citation key on the following slide provides information about how you may share and adapt this
material.
Copyright holders of content included in this material should contact open.michigan@umich.edu with any questions,
corrections, or clarification regarding the use of content.
For more information about how to cite these materials visit http://open.umich.edu/education/about/terms-of-use.
Any medical information in this material is intended to inform and educate and is not a tool for self-diagnosis or a
replacement for medical evaluation, advice, diagnosis or treatment by a healthcare professional. Please speak to your
physician if you have questions about your medical condition.
Viewer discretion is advised: Some medical content is graphic and may not be suitable for all viewers.
Citation Key
for more information see: http://open.umich.edu/wiki/CitationPolicy
Use + Share + Adapt
{ Content the copyright holder, author, or law permits you to use, share and adapt. }
Public Domain – Government: Works that are produced by the U.S. Government. (17 USC § 105)
Public Domain – Expired: Works that are no longer protected due to an expired copyright term.
Public Domain – Self Dedicated: Works that a copyright holder has dedicated to the public domain.
Creative Commons – Zero Waiver
Creative Commons – Attribution License
Creative Commons – Attribution Share Alike License
Creative Commons – Attribution Noncommercial License
Creative Commons – Attribution Noncommercial Share Alike License
GNU – Free Documentation License
Make Your Own Assessment
{ Content Open.Michigan believes can be used, shared, and adapted because it is ineligible for copyright. }
Public Domain – Ineligible: Works that are ineligible for copyright protection in the U.S. (17 USC § 102(b)) *laws in
your jurisdiction may differ
{ Content Open.Michigan has used under a Fair Use determination. }
Fair Use: Use of works that is determined to be Fair consistent with the U.S. Copyright Act. (17 USC § 107) *laws in
your jurisdiction may differ
Our determination DOES NOT mean that all uses of this 3rd-party content are Fair Uses and we DO NOT guarantee
that your use of the content is Fair.
To use this content you should do your own independent analysis to determine whether or not your use will be Fair.
Information Retrieval and
Natural Language Processing
David Hanauer
October 22, 2013
Disclosure
I have no conflicts of interest and no disclosures
to report
What is NLP?
The process of converting unstructured
language into a computable, structured form
so that a deep understanding of the meaning
embedded in the text can be extracted.
What is NLP?
NLP is more than just identifying words and
phrases. It deals (or should deal) with
ambiguity, negation, co-references, etc.
NLP Initial steps
• NLP initially involves breaking down text into
sentences, phrases, parts of speech, and actual
words. Even this can be complex!
Dr. Jones wanted to take a ride
along Rodeo Dr. Jones was 12.5
miles away. He was driving to the
rodeo and drinking an orange
Jones soda and looking at the
orange sunset.
Language can be complex
• Syntax (structure) and Semantics (meaning)
are both important
Time
Mead C, 2006
flies like an arrow
Language can be complex
• Syntax (structure) and Semantics (meaning)
are both important
Time flies like an arrow
Fruit flies like a banana
Mead C, 2006
Language can be complex
• Syntax (structure) and Semantics (meaning)
are both important
Time flies like an arrow
Fruit flies like a banana
Mead C, 2006
NLP Initial steps
• “The next United States presidential election
is to be held on Tuesday, November 6, 2012.”
http://tomato.banatao.berkeley.edu:8080/parser/parser.html
Mapping to concepts is often required
Processing 00000000.tx.1: lung cancer.
Phrase: "lung cancer." Meta Candidates (8):
1000 Lung Cancer (Malignant neoplasm
1000 Lung Cancer (Carcinoma of lung) [Neoplastic Process]
861 Cancer (Malignant Neoplasms) [Neoplastic Process]
861 Lung [Body Part, Organ, or Organ Component
861 Cancer (Cancer Genus) [Invertebrate]
861 Lung (Entire lung) [Body Part, Organ, or Organ Component]
861 Cancer (Specialty Type - cancer) [Biomedical Occupation or Discipline]
768 Pneumonia [Disease or Syndrome]
Meta Mapping (1000):
1000 Lung Cancer (Carcinoma of lung) [Neoplastic Process]
Meta Mapping (1000):
1000 Lung Cancer (Malignant neoplasm of lung) [Neoplastic Process]
Machine learning is often used
• Hand annotated examples used to “train” a
system so it can “learn” from the examples.
• Patterns can then be detected in new
examples and a probability of meaning can be
assigned.
• Involves discerning between various
possibilities.
Machine learning
http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=MachineLearning&doc=exercises/ex8/ex8.html
Machine learning
Machine learning
Machine learning
Machine learning
Why is this so complex?
• Natural language does follow some ‘rules’ but
it can be very free form.
• There are many ways to express the same or
similar concepts
Why is this so complex?
• “Forrest illustrated the difficulties with respect
to standards by citing a one-day record from
Children’s Hospital in Philadelphia in which
clinicians entering data into electronic medical
records (EMRs) used 278 ways to describe
fever for 465 patients, 123 different ways to
express ear pain in 213 patients, and 99
different ways to describe red ears.”
Rubinsetein YR, Contemp Clin Trials. 2010 September ; 31(5): 394–404
Many clinical notes are dictated…
…and this creates problems
Transcription Errors
• What was dictated:
– “He has no nasal flaring”
• What was transcribed:
– “He has no nasopharynx”
Transcription Errors
• What was dictated:
– “given a prescription for an albuterol MDI”
• What was transcribed:
– “given a prescription for an albumin MDI”
Transcription Errors
• What was dictated:
– “has had a runny nose since September”
• What was transcribed:
– “has had a funny nose since September"
Transcription Errors
Runny Nose
Funny Nose
Transcription Errors
• What was dictated:
–
????????
• What was transcribed:
– “Prior to discharge from the emergency
department she was offered some Motrin,
however she deferred atrial tachycardia.”
Medical terms are difficult to spell
• ibuprofen
– ibuprofin
– Ibuprophen
– ibuprophin
Medical terms are difficult to spell
• diarrhea
–
–
–
–
–
–
–
–
–
–
–
diarrheae
dirreahea
diarheea
diahhrea
diahrrea
diarhhea
diarreah
diarehha
diarrea
diahrea
diarhea
Clinical notes aren’t always “natural” language
• Limited or no structure
• Multiple abbreviations, many non-standard
14 yo here with dad.
2 days frontal HA and fever. No ST, No cough, No RN, mild abd pain,
sl dizzy today.
Perfectly well when he has has motrin.
No illnesses in years
T 98.2
Wt 112
TM perfect. Throat nl Chest clear. Turbs pink with exudate.
Imp URI. Course reviewed
Non-standard abbreviation
Missing units (pounds or kg?)
Missing punctuation
14 yo here with dad.
2 days frontal HA and fever. No ST, No cough, No RN, mild abd pain,
sl dizzy today.
Perfectly well when he has has motrin.
No illnesses in years
T 98.2
Wt 112
TM perfect. Throat nl Chest clear. Turbs pink with exudate.
Imp URI. Course reviewed
Clinical notes aren’t always “natural” language
• Brief intake note from medical assistant
wheezing, coughing, running nose
Mortin - 10:30 AM
pain in troat had toni. removed
Clinical notes aren’t always “natural” language
This says, “pain in throat, had tonsils removed.”
Spelling error
wheezing, coughing, running nose
Mortin - 10:30 AM
pain in troat had toni. removed
Free text notes are complex
• Discordant information is often present
The patient is seen today in follow-up for: 1. Status-post
aortic valve repair. 2. Status-post single vessel coronary
artery bypass. 3. Hypertension. 4. Hyperlipidemia. 5.
Atrial fibrillation.
The patient underwent mitral valve repair surgery and one
vessel bypass six weeks ago. He is doing well. He is
active. He has no dyspnea or angina or palpitations.
He saw his cardiologist last week and everything is going
well. He does note some ongoing fatigue. He also has had
difficulty maintaining his weight.
Free text notes are complex
• Discordant information is often present
The Maternal Past Medical and Pregnancy History:
Healthy prior to the pregnancy.
Course was complicated uncomplicated. Normal fetal
anatomic survey. Medications during the pregnancy
included PNV.
Free text notes are complex
• Discordant information is often present
He is approved to play Sports without restrictions,
but no form filled out. School form was filled out.
Free text notes are complex
• Discordant information is often present
History of Present Illness: The patient is an 86-year-old
African- American female with severe Alzheimer's disease,
and a history of multiple falls, who presents to the
Emergency Department after a witnessed fall from her bed,
and thus she injured her left wrist.
...
Physical Exam: General: Elderly white female appearing
somewhat confused, in no apparent distress.
Temperature 97.9. Heart rate 77. Respiratory rate 18.
Blood pressure 106/53.
Free text notes are complex
• Discordant information can exist over time as
well
1997
She has no history of hypertension or hyperlipidemia.
1998
Cardiac risk factors include a history of tobacco abuse,
having quit in October, as well as a history of hypertension
and hypercholesterolemia…
2002
She denies hypertension, diabetes, or hypercholesterolemia.
2004
Weight was 191.8, blood pressure with the M.A. was 142/98
and I got 128/84 when I rechecked it, and pulse was 62.
Free text notes are complex
• Discordant information can exist over time as
well (even just a few days)
Family Medicine Doctor (June 10, 2007):
Social History: She has never smoked, gets occasional
exercise, and drinks occasional alcohol. No history of
street drug use.
Emergency Medicine Doctor (June 14, 2007):
SOCIAL HISTORY: The patient occasionally smokes.
Denies any alcohol or drug use.
Copy-paste is also a problem
DOCTOR A (Day 1):
ASSESSMENT: <Name> is a three week old newborn with a
history of hyperbilirubinemia, who presents today after an
apparent life threatening event.
DOCTOR B (Day 3):
Assessment: <Name> is a three week old newborn with a
history of hyperbilirubinemia, who presents today after an
apparent life threatening event.
DOCTOR B (Day 4):
Assessment: <Name> is a three week old newborn with a
history of hyperbilirubinemia, who presents today after an
apparent life threatening event.
Copy-paste is also a problem
DOCTOR C (Day 5):
Assessment: <Name> is a three-week old newborn with a
history of hyperbilirubinemia, who presents today after an
apparent life threatening event characterized by cyanosis,
extremity and hand stiffening, apnea, and unresponsiveness
for 45 seconds-1 minute.
Doctor C (Day 6):
ADMISSION HISTORY: <Name> is a three week-old newborn with
a history of hyperbilirubinemia, who presents today from
clinic after her mother witnessed an episode of cyanosis
and apnea yesterday evening. <Name>'s mother explains that
<Name> fell asleep after feeding at 5:30PM yesterday.
Inaccurate Documentation
“This project determined the prevalence of
inaccurate clinical documentation resulting from
the use of computer-based documentation
systems that allow carry-forward of prior
information. The study found a failure rate of 8%
in a random sample of all electronic notes and a
failure rate of 26% in a random sample of notes
generated by reusing prior clinical notes.”
Ambiguous wording
• “melanoma; however, the findings are
subthreshold”
• “I favor atypical combined nevus rather than
melanoma”
• “Unequivocal findings of melanoma are not
identified.”
Hedge phrases – Probabilities for “Probable”
Hedge phrases
Hedge phrases
Hedge phrases
Hedge phrases
• Even determining if something is a hedge
phrase can be challenging…
• He was here in May.
• He may have melanoma.
• He may call back at any time with questions.
Hedge phrases
Q: Should hedge phrases be taken into
consideration with NLP or IR tasks?
Hedge phrases
Q: Should hedge phrases be taken into
consideration with NLP or IR tasks?
A: Probably
Language can be ambiguous
Language can be ambiguous
moped
Language can be ambiguous
moped
Language can be ambiguous
moped
Language can be ambiguous
moped  mopped
Image
removed copyright
Multiple Synonyms
• Carcinoma, Cancer, Ca, Tumor, Neoplasm
• Zithromax, Z-pax, Azithromycin, Zmax
• White Count, Leukocyte Count, WBC
Ambiguous Abbreviations
• CA
– Calcium
– Cancer
– California
• MI
– Myocardial Infarction
– Michigan
• T1
– TNM cancer staging
– Bone of thoracic spine
– MRI weighting
Case-sensitivity
• Did all of the patients in clinic with ALL arrive
on time?
• The renal patient with a high BUN can eat the
bun but not the burger.
• He had FROM of his hips bilaterally and could
walk from the chair to the door.
Case-sensitivity
A
OR
IN
IT
HE
IS
ARE
AND
ALL
DOT GAS PAN
TEN
PAT BUS
POEMS
RICE
TIPS
CHOP
SLAP
PANDAS
Case-sensitivity
artery
inches
Helium
Axillary node dissection
Incentive spirometry
active resistance exercise
Operating room
intrathecal
acute lymphoblastic leukemia
directly obeserved therapy
group A strep
paroxysmal atrial tachycardia
polyarteritis nodosa
toxic epidermal necrolysis
polyneuropathy, organomegaly, endocrinopathy, monoclonal gammopathy and skin
buspirone
rest, ice, compression, elevation
transvenous Intrahepatic Portosystemic Shunt
cyclophosphamide, hydroxydoxorubicin, Oncovin, prednisone
serum leucine aminopeptidase
pediatric autoimmune neuropsychiatric disorders associated with Streptococcus
Case-sensitivity
• DID ALL TEN PANDAS EAT TUNA FISH AND
RICE?
–
–
–
–
–
–
–
–
–
Dissociative identity disorder
Acute lymphoblastic leukemia
Toxic epidermal necrolysis
Pediatric autoimmune neuropsychiatric disorder associated with streptococcus
Ectopic atrial tachycardia
Transurethral needle ablation
Fluorescence in-situ hybridization
Axillary node dissection
Rest, Ice, Compression, Elevation
Case-sensitivity
According to her EHR she went to the OR in
OR or MI for an AND of her ER pos and
HER 2 pos breast ca and to the ER in CA
for her pos MI.
Case-sensitivity
ACCORDING TO HER EHR SHE WENT TO
THE OR IN OR OR MI FOR AN AND OF
HER ER POS AND HER 2 POS BREAST
CA AND TO THE ER IN CA FOR HER
POS MI.
Case-sensitivity
Electronic Health Record
Oregon
Michigan
Operating
Room
Estrogen receptor
Auxillary Node
positive
Dissection
positive
According to her EHR she went to the OR in
OR or MI for an AND of her ER pos and
HER 2 pos breast ca and to the ER in CA
for her pos MI.
human
epidermal
Growth
factor
receptor 2
positive
possible
Myocardial
Infarction
cancer
Emergency
Room
California
De-Identification (NLP example)
Why do we need de-id?
– Reduce risk
– Increase HIPAA compliance
– A lot of people look at medical records who are
not clinicians and don’t need to know who the
patient is
– Maintain trust of our patients
18 HIPAA identifiers
• names
• geographic subdivisions smaller
than a state
• dates (with exceptions)
• telephone numbers
• FAX numbers
• electronic mail addresses
• Social Security numbers
• medical record numbers
• health plan beneficiary numbers
• account numbers
• certificate/license #s
• vehicle identifiers , license plates,
etc.
• device identifiers and serial
numbers
• web URLs
• IP addresses
• Finger prints, voice prints
• full face photos and comparable
images
• any unique identifying number,
characteristic or code
Free text de-identification
John Doe is a 2 year-old boy diagnosed with a
malignant diffuse intrinsic pontine glioma in
February 2004 who has now completed
radiation therapy. Friends for Michael, a
foundation that the family independently
discovered, contacted me on Friday
(09/05/05) to verify patient's diagnosis.
Free text de-identification
John Doe is a 2 year-old boy diagnosed with a
malignant diffuse intrinsic pontine glioma in
February 2004 who has now completed
radiation therapy. Friends for Michael, a
foundation that the family independently
discovered, contacted me on Friday
(09/05/05) to verify patient's diagnosis.
Free text de-identification
His final path showed a 12.5 x 7 x 7cm
squamous cell carcinoma with 4/23 lymph
nodes involved. Bronchoscopy culture from
4/19 numerous Strep. Milleri and H.
parainfluenzae. In addition, respiratory culture
from 4/23 grew pansensitive pseudomonas
aeurginosa.
Free text de-identification
His final path showed a 12.5 x 7 x 7cm
squamous cell carcinoma with 4/23 lymph
nodes involved. Bronchoscopy culture from
4/19 numerous Strep. Milleri and H.
parainfluenzae. In addition, respiratory culture
from 4/23 grew pansensitive pseudomonas
aeurginosa.
How modern de-id systems work
Statistical model built using “features” of the
text thought to be important to discriminate
protected health information (PHI) from nonPHI.
Common features used in de-id
1. Morphological
– Capitalization
– Neighboring words
– Punctuation
2. Syntactic
– Parts of speech
3. Semantic
– Dictionary terms (names, cities, hospitals)
MITRE Identification Scrubber Toolkit (MIST)
MITRE Identification Scrubber Toolkit (MIST)
Brief live demo (if possible)…
What else are people doing with NLP?
• Natural Language Processing to Identify Venous
Thromboembolic Events (Reichley RM 2007)
• Automated Identification of Postoperative
Complications Within an Electronic Medical Record
Using Natural Language Processing (Murff HJ 2011)
• Automatic Identification of Critical Follow-Up
Recommendation Sentences in Radiology Reports
(Yetisgen-Yildiz M 2011)
• Mayo Clinic NLP System for patient smoking status
identification (Savova GK 2008)
Will NLP still be important in 10 years?
Maybe we should just code everything initially…
Will NLP still be important in 10 years?
“[T]here has been an emphasis on deploying
computer-based documentation systems that
prioritize direct structured documentation.
Research has demonstrated that healthcare
providers value different factors when writing
clinical notes, such as narrative expressivity,
amenability to the existing workflow, and
usability.”
NLP is difficult to implement
Wendy Chapman iDASH talk, May 20, 2011
EMERSE
Electronic Medical Record Search Engine
• Not NLP (in the modern sense), but
usable by non-technical people
• Let’s humans do the “processing”, based
on search engine technology to help
users find what they need
• Low barrier for adoption
EMERSE
Brief live demo (if possible)…
Image Attributions
•
•
•
•
•
•
•
•
•
•
“Banana” by nicubunu is in the Public Domain.
“Wings” by dear_theophilus is in the Public Domain.
“Ill” by William Brawley is under a Creative Commons license CC BY 2.0.
https://www.flickr.com/photos/williambrawley/4195919691
“Clown chili peppers” by Rick Dikeman is under a Creative Commons license CC BY-SA 3.0.
https://commons.wikimedia.org/wiki/File:Clown_chili_peppers.jpg
“Blue scooter” by pearish is in the Public Domain.
“Frown” by Arvin61r58 is in the Public Domain.
“Moe Biggie” by Pete Simon is under a Creative Commons license CC BY 2.0.
“Panda” by Machovka is in the Public Domain.
“Tekka maki sushi” by johnny_automatic is in the Public Domain.
“My Wife and My Mother-in-Law” by William Ely Hill is in the Public Domain.
Download