Author(s): David Hanauer License: Unless otherwise noted, this material is made available under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 License: http://creativecommons.org/licenses/by-nc-sa/3.0/ We have reviewed this material in accordance with U.S. Copyright Law and have tried to maximize your ability to use, share, and adapt it. The citation key on the following slide provides information about how you may share and adapt this material. Copyright holders of content included in this material should contact open.michigan@umich.edu with any questions, corrections, or clarification regarding the use of content. For more information about how to cite these materials visit http://open.umich.edu/education/about/terms-of-use. Any medical information in this material is intended to inform and educate and is not a tool for self-diagnosis or a replacement for medical evaluation, advice, diagnosis or treatment by a healthcare professional. Please speak to your physician if you have questions about your medical condition. Viewer discretion is advised: Some medical content is graphic and may not be suitable for all viewers. Citation Key for more information see: http://open.umich.edu/wiki/CitationPolicy Use + Share + Adapt { Content the copyright holder, author, or law permits you to use, share and adapt. } Public Domain – Government: Works that are produced by the U.S. Government. (17 USC § 105) Public Domain – Expired: Works that are no longer protected due to an expired copyright term. Public Domain – Self Dedicated: Works that a copyright holder has dedicated to the public domain. Creative Commons – Zero Waiver Creative Commons – Attribution License Creative Commons – Attribution Share Alike License Creative Commons – Attribution Noncommercial License Creative Commons – Attribution Noncommercial Share Alike License GNU – Free Documentation License Make Your Own Assessment { Content Open.Michigan believes can be used, shared, and adapted because it is ineligible for copyright. } Public Domain – Ineligible: Works that are ineligible for copyright protection in the U.S. (17 USC § 102(b)) *laws in your jurisdiction may differ { Content Open.Michigan has used under a Fair Use determination. } Fair Use: Use of works that is determined to be Fair consistent with the U.S. Copyright Act. (17 USC § 107) *laws in your jurisdiction may differ Our determination DOES NOT mean that all uses of this 3rd-party content are Fair Uses and we DO NOT guarantee that your use of the content is Fair. To use this content you should do your own independent analysis to determine whether or not your use will be Fair. Information Retrieval and Natural Language Processing David Hanauer October 22, 2013 Disclosure I have no conflicts of interest and no disclosures to report What is NLP? The process of converting unstructured language into a computable, structured form so that a deep understanding of the meaning embedded in the text can be extracted. What is NLP? NLP is more than just identifying words and phrases. It deals (or should deal) with ambiguity, negation, co-references, etc. NLP Initial steps • NLP initially involves breaking down text into sentences, phrases, parts of speech, and actual words. Even this can be complex! Dr. Jones wanted to take a ride along Rodeo Dr. Jones was 12.5 miles away. He was driving to the rodeo and drinking an orange Jones soda and looking at the orange sunset. Language can be complex • Syntax (structure) and Semantics (meaning) are both important Time Mead C, 2006 flies like an arrow Language can be complex • Syntax (structure) and Semantics (meaning) are both important Time flies like an arrow Fruit flies like a banana Mead C, 2006 Language can be complex • Syntax (structure) and Semantics (meaning) are both important Time flies like an arrow Fruit flies like a banana Mead C, 2006 NLP Initial steps • “The next United States presidential election is to be held on Tuesday, November 6, 2012.” http://tomato.banatao.berkeley.edu:8080/parser/parser.html Mapping to concepts is often required Processing 00000000.tx.1: lung cancer. Phrase: "lung cancer." Meta Candidates (8): 1000 Lung Cancer (Malignant neoplasm 1000 Lung Cancer (Carcinoma of lung) [Neoplastic Process] 861 Cancer (Malignant Neoplasms) [Neoplastic Process] 861 Lung [Body Part, Organ, or Organ Component 861 Cancer (Cancer Genus) [Invertebrate] 861 Lung (Entire lung) [Body Part, Organ, or Organ Component] 861 Cancer (Specialty Type - cancer) [Biomedical Occupation or Discipline] 768 Pneumonia [Disease or Syndrome] Meta Mapping (1000): 1000 Lung Cancer (Carcinoma of lung) [Neoplastic Process] Meta Mapping (1000): 1000 Lung Cancer (Malignant neoplasm of lung) [Neoplastic Process] Machine learning is often used • Hand annotated examples used to “train” a system so it can “learn” from the examples. • Patterns can then be detected in new examples and a probability of meaning can be assigned. • Involves discerning between various possibilities. Machine learning http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=MachineLearning&doc=exercises/ex8/ex8.html Machine learning Machine learning Machine learning Machine learning Why is this so complex? • Natural language does follow some ‘rules’ but it can be very free form. • There are many ways to express the same or similar concepts Why is this so complex? • “Forrest illustrated the difficulties with respect to standards by citing a one-day record from Children’s Hospital in Philadelphia in which clinicians entering data into electronic medical records (EMRs) used 278 ways to describe fever for 465 patients, 123 different ways to express ear pain in 213 patients, and 99 different ways to describe red ears.” Rubinsetein YR, Contemp Clin Trials. 2010 September ; 31(5): 394–404 Many clinical notes are dictated… …and this creates problems Transcription Errors • What was dictated: – “He has no nasal flaring” • What was transcribed: – “He has no nasopharynx” Transcription Errors • What was dictated: – “given a prescription for an albuterol MDI” • What was transcribed: – “given a prescription for an albumin MDI” Transcription Errors • What was dictated: – “has had a runny nose since September” • What was transcribed: – “has had a funny nose since September" Transcription Errors Runny Nose Funny Nose Transcription Errors • What was dictated: – ???????? • What was transcribed: – “Prior to discharge from the emergency department she was offered some Motrin, however she deferred atrial tachycardia.” Medical terms are difficult to spell • ibuprofen – ibuprofin – Ibuprophen – ibuprophin Medical terms are difficult to spell • diarrhea – – – – – – – – – – – diarrheae dirreahea diarheea diahhrea diahrrea diarhhea diarreah diarehha diarrea diahrea diarhea Clinical notes aren’t always “natural” language • Limited or no structure • Multiple abbreviations, many non-standard 14 yo here with dad. 2 days frontal HA and fever. No ST, No cough, No RN, mild abd pain, sl dizzy today. Perfectly well when he has has motrin. No illnesses in years T 98.2 Wt 112 TM perfect. Throat nl Chest clear. Turbs pink with exudate. Imp URI. Course reviewed Non-standard abbreviation Missing units (pounds or kg?) Missing punctuation 14 yo here with dad. 2 days frontal HA and fever. No ST, No cough, No RN, mild abd pain, sl dizzy today. Perfectly well when he has has motrin. No illnesses in years T 98.2 Wt 112 TM perfect. Throat nl Chest clear. Turbs pink with exudate. Imp URI. Course reviewed Clinical notes aren’t always “natural” language • Brief intake note from medical assistant wheezing, coughing, running nose Mortin - 10:30 AM pain in troat had toni. removed Clinical notes aren’t always “natural” language This says, “pain in throat, had tonsils removed.” Spelling error wheezing, coughing, running nose Mortin - 10:30 AM pain in troat had toni. removed Free text notes are complex • Discordant information is often present The patient is seen today in follow-up for: 1. Status-post aortic valve repair. 2. Status-post single vessel coronary artery bypass. 3. Hypertension. 4. Hyperlipidemia. 5. Atrial fibrillation. The patient underwent mitral valve repair surgery and one vessel bypass six weeks ago. He is doing well. He is active. He has no dyspnea or angina or palpitations. He saw his cardiologist last week and everything is going well. He does note some ongoing fatigue. He also has had difficulty maintaining his weight. Free text notes are complex • Discordant information is often present The Maternal Past Medical and Pregnancy History: Healthy prior to the pregnancy. Course was complicated uncomplicated. Normal fetal anatomic survey. Medications during the pregnancy included PNV. Free text notes are complex • Discordant information is often present He is approved to play Sports without restrictions, but no form filled out. School form was filled out. Free text notes are complex • Discordant information is often present History of Present Illness: The patient is an 86-year-old African- American female with severe Alzheimer's disease, and a history of multiple falls, who presents to the Emergency Department after a witnessed fall from her bed, and thus she injured her left wrist. ... Physical Exam: General: Elderly white female appearing somewhat confused, in no apparent distress. Temperature 97.9. Heart rate 77. Respiratory rate 18. Blood pressure 106/53. Free text notes are complex • Discordant information can exist over time as well 1997 She has no history of hypertension or hyperlipidemia. 1998 Cardiac risk factors include a history of tobacco abuse, having quit in October, as well as a history of hypertension and hypercholesterolemia… 2002 She denies hypertension, diabetes, or hypercholesterolemia. 2004 Weight was 191.8, blood pressure with the M.A. was 142/98 and I got 128/84 when I rechecked it, and pulse was 62. Free text notes are complex • Discordant information can exist over time as well (even just a few days) Family Medicine Doctor (June 10, 2007): Social History: She has never smoked, gets occasional exercise, and drinks occasional alcohol. No history of street drug use. Emergency Medicine Doctor (June 14, 2007): SOCIAL HISTORY: The patient occasionally smokes. Denies any alcohol or drug use. Copy-paste is also a problem DOCTOR A (Day 1): ASSESSMENT: <Name> is a three week old newborn with a history of hyperbilirubinemia, who presents today after an apparent life threatening event. DOCTOR B (Day 3): Assessment: <Name> is a three week old newborn with a history of hyperbilirubinemia, who presents today after an apparent life threatening event. DOCTOR B (Day 4): Assessment: <Name> is a three week old newborn with a history of hyperbilirubinemia, who presents today after an apparent life threatening event. Copy-paste is also a problem DOCTOR C (Day 5): Assessment: <Name> is a three-week old newborn with a history of hyperbilirubinemia, who presents today after an apparent life threatening event characterized by cyanosis, extremity and hand stiffening, apnea, and unresponsiveness for 45 seconds-1 minute. Doctor C (Day 6): ADMISSION HISTORY: <Name> is a three week-old newborn with a history of hyperbilirubinemia, who presents today from clinic after her mother witnessed an episode of cyanosis and apnea yesterday evening. <Name>'s mother explains that <Name> fell asleep after feeding at 5:30PM yesterday. Inaccurate Documentation “This project determined the prevalence of inaccurate clinical documentation resulting from the use of computer-based documentation systems that allow carry-forward of prior information. The study found a failure rate of 8% in a random sample of all electronic notes and a failure rate of 26% in a random sample of notes generated by reusing prior clinical notes.” Ambiguous wording • “melanoma; however, the findings are subthreshold” • “I favor atypical combined nevus rather than melanoma” • “Unequivocal findings of melanoma are not identified.” Hedge phrases – Probabilities for “Probable” Hedge phrases Hedge phrases Hedge phrases Hedge phrases • Even determining if something is a hedge phrase can be challenging… • He was here in May. • He may have melanoma. • He may call back at any time with questions. Hedge phrases Q: Should hedge phrases be taken into consideration with NLP or IR tasks? Hedge phrases Q: Should hedge phrases be taken into consideration with NLP or IR tasks? A: Probably Language can be ambiguous Language can be ambiguous moped Language can be ambiguous moped Language can be ambiguous moped Language can be ambiguous moped mopped Image removed copyright Multiple Synonyms • Carcinoma, Cancer, Ca, Tumor, Neoplasm • Zithromax, Z-pax, Azithromycin, Zmax • White Count, Leukocyte Count, WBC Ambiguous Abbreviations • CA – Calcium – Cancer – California • MI – Myocardial Infarction – Michigan • T1 – TNM cancer staging – Bone of thoracic spine – MRI weighting Case-sensitivity • Did all of the patients in clinic with ALL arrive on time? • The renal patient with a high BUN can eat the bun but not the burger. • He had FROM of his hips bilaterally and could walk from the chair to the door. Case-sensitivity A OR IN IT HE IS ARE AND ALL DOT GAS PAN TEN PAT BUS POEMS RICE TIPS CHOP SLAP PANDAS Case-sensitivity artery inches Helium Axillary node dissection Incentive spirometry active resistance exercise Operating room intrathecal acute lymphoblastic leukemia directly obeserved therapy group A strep paroxysmal atrial tachycardia polyarteritis nodosa toxic epidermal necrolysis polyneuropathy, organomegaly, endocrinopathy, monoclonal gammopathy and skin buspirone rest, ice, compression, elevation transvenous Intrahepatic Portosystemic Shunt cyclophosphamide, hydroxydoxorubicin, Oncovin, prednisone serum leucine aminopeptidase pediatric autoimmune neuropsychiatric disorders associated with Streptococcus Case-sensitivity • DID ALL TEN PANDAS EAT TUNA FISH AND RICE? – – – – – – – – – Dissociative identity disorder Acute lymphoblastic leukemia Toxic epidermal necrolysis Pediatric autoimmune neuropsychiatric disorder associated with streptococcus Ectopic atrial tachycardia Transurethral needle ablation Fluorescence in-situ hybridization Axillary node dissection Rest, Ice, Compression, Elevation Case-sensitivity According to her EHR she went to the OR in OR or MI for an AND of her ER pos and HER 2 pos breast ca and to the ER in CA for her pos MI. Case-sensitivity ACCORDING TO HER EHR SHE WENT TO THE OR IN OR OR MI FOR AN AND OF HER ER POS AND HER 2 POS BREAST CA AND TO THE ER IN CA FOR HER POS MI. Case-sensitivity Electronic Health Record Oregon Michigan Operating Room Estrogen receptor Auxillary Node positive Dissection positive According to her EHR she went to the OR in OR or MI for an AND of her ER pos and HER 2 pos breast ca and to the ER in CA for her pos MI. human epidermal Growth factor receptor 2 positive possible Myocardial Infarction cancer Emergency Room California De-Identification (NLP example) Why do we need de-id? – Reduce risk – Increase HIPAA compliance – A lot of people look at medical records who are not clinicians and don’t need to know who the patient is – Maintain trust of our patients 18 HIPAA identifiers • names • geographic subdivisions smaller than a state • dates (with exceptions) • telephone numbers • FAX numbers • electronic mail addresses • Social Security numbers • medical record numbers • health plan beneficiary numbers • account numbers • certificate/license #s • vehicle identifiers , license plates, etc. • device identifiers and serial numbers • web URLs • IP addresses • Finger prints, voice prints • full face photos and comparable images • any unique identifying number, characteristic or code Free text de-identification John Doe is a 2 year-old boy diagnosed with a malignant diffuse intrinsic pontine glioma in February 2004 who has now completed radiation therapy. Friends for Michael, a foundation that the family independently discovered, contacted me on Friday (09/05/05) to verify patient's diagnosis. Free text de-identification John Doe is a 2 year-old boy diagnosed with a malignant diffuse intrinsic pontine glioma in February 2004 who has now completed radiation therapy. Friends for Michael, a foundation that the family independently discovered, contacted me on Friday (09/05/05) to verify patient's diagnosis. Free text de-identification His final path showed a 12.5 x 7 x 7cm squamous cell carcinoma with 4/23 lymph nodes involved. Bronchoscopy culture from 4/19 numerous Strep. Milleri and H. parainfluenzae. In addition, respiratory culture from 4/23 grew pansensitive pseudomonas aeurginosa. Free text de-identification His final path showed a 12.5 x 7 x 7cm squamous cell carcinoma with 4/23 lymph nodes involved. Bronchoscopy culture from 4/19 numerous Strep. Milleri and H. parainfluenzae. In addition, respiratory culture from 4/23 grew pansensitive pseudomonas aeurginosa. How modern de-id systems work Statistical model built using “features” of the text thought to be important to discriminate protected health information (PHI) from nonPHI. Common features used in de-id 1. Morphological – Capitalization – Neighboring words – Punctuation 2. Syntactic – Parts of speech 3. Semantic – Dictionary terms (names, cities, hospitals) MITRE Identification Scrubber Toolkit (MIST) MITRE Identification Scrubber Toolkit (MIST) Brief live demo (if possible)… What else are people doing with NLP? • Natural Language Processing to Identify Venous Thromboembolic Events (Reichley RM 2007) • Automated Identification of Postoperative Complications Within an Electronic Medical Record Using Natural Language Processing (Murff HJ 2011) • Automatic Identification of Critical Follow-Up Recommendation Sentences in Radiology Reports (Yetisgen-Yildiz M 2011) • Mayo Clinic NLP System for patient smoking status identification (Savova GK 2008) Will NLP still be important in 10 years? Maybe we should just code everything initially… Will NLP still be important in 10 years? “[T]here has been an emphasis on deploying computer-based documentation systems that prioritize direct structured documentation. Research has demonstrated that healthcare providers value different factors when writing clinical notes, such as narrative expressivity, amenability to the existing workflow, and usability.” NLP is difficult to implement Wendy Chapman iDASH talk, May 20, 2011 EMERSE Electronic Medical Record Search Engine • Not NLP (in the modern sense), but usable by non-technical people • Let’s humans do the “processing”, based on search engine technology to help users find what they need • Low barrier for adoption EMERSE Brief live demo (if possible)… Image Attributions • • • • • • • • • • “Banana” by nicubunu is in the Public Domain. “Wings” by dear_theophilus is in the Public Domain. “Ill” by William Brawley is under a Creative Commons license CC BY 2.0. https://www.flickr.com/photos/williambrawley/4195919691 “Clown chili peppers” by Rick Dikeman is under a Creative Commons license CC BY-SA 3.0. https://commons.wikimedia.org/wiki/File:Clown_chili_peppers.jpg “Blue scooter” by pearish is in the Public Domain. “Frown” by Arvin61r58 is in the Public Domain. “Moe Biggie” by Pete Simon is under a Creative Commons license CC BY 2.0. “Panda” by Machovka is in the Public Domain. “Tekka maki sushi” by johnny_automatic is in the Public Domain. “My Wife and My Mother-in-Law” by William Ely Hill is in the Public Domain.