Integrating Data for Analysis, Anonymization, and Sharing An NLP Ecosystem for Development and Use of Natural Language Processing in the Clinical Domain Wendy W. Chapman, PhD Division of Biomedical Informatics University of California, San Diego Overview • The promise of natural language processing (NLP) • Challenges of developing NLP in the clinical domain • Challenges in applying NLP in the clinical domain • iDASH • Opportunities for sharing and collaboration in NLP NLP Success Fresh off its butt-kicking performance on Jeopardy!, IBM’s supercomputer "Watson" has enrolled in medical school at Columbia University,” New York Daily News February 18th 2011 Clinical NLP Since 1960’s Why has clinical NLP had little impact on clinical care? Barriers to Development • Sharing clinical data difficult – Have not had shared datasets for development and evaluation – Modules trained on general English not sufficient • Insufficient common conventions and standards for annotations – Data sets are unique to a lab – Not easily interchangeable • Limited collaboration – Clinical NLP applications silos and black boxes – Have not had open source applications • Reproducibility is formidable – Open source release not always sufficient – Software engineering quality not always great – Mechanisms for reproducing results are sparse Overview • The promise of natural language processing (NLP) • Challenges of developing NLP in the clinical domain • Challenges in applying NLP in the clinical domain • Developing an NLP ecosystem on iDASH Security & Privacy Concerns • Clinical texts have many patient identifiers – 18 HIPAA identifiers • Names • Addresses • Items not regulated by HIPAA Institutions are reluctant to share data – tight end for the Steelers • Unique cases – 50s-year-old woman who is pregnant • Sensitive information – HIV status Lack of user-centered development and scalability – Perceived cost of applying NLP outweighs the perceived benefit (Len D’Avolio) Overview • The promise of natural language processing (NLP) • Challenges of developing NLP in the clinical domain • Challenges in applying NLP in the clinical domain • Developing an NLP ecosystem on iDASH iDASH • integrating Data • Analysis • Anonymization • Sharing Data Software/Tools Computational Resources Disincentives to Share • ‘Scooping’ by faster analysts Exposure of potential errors in data • Resources for preparing data submissions • Maintaining data • Interacting with potential users takes time • Threat of privacy breach when human subjects are involved – Do not have policies in place – Fallible de-identification, anonymization algorithms iDASH aims to minimize these disincentives nlp-ecosystem.ucsd.edu HIPAA &/or FISMA Compliant Cloud • • • • Access control De-identification Query counts Artificial data generators Digital Informed consent Privacy preserving Informed Consent Registry Customizable DUAs Schemas Bibliography Tutorials Research Guidelines Resources Education NLP Ecosystem Data DeIdentification Tools & Services TxtVect Virtual Machines Registry UCSD Clinical Data Evaluation Workbench MT Samples Collaborative Development Tools Annotation Admin & eHOST 2011 summer internship program funded by NIH U54HL108460 15 Collaborative Effort to Build Ecosystem DeIdentification Tools & Services TextVect Increase access to NLP Virtual Machines Registry Evaluation Workbench Collaborative Knowledge Authoring Annotation Environment Decrease Burden of Developing NLP Increase ability to find NLP tools orbit Registry: orbit.nlm.nih.gov Len D’Avolio, Dina Demner-Fushman Increase access to clinical text De-identification service De-identification • Several available de-identification modules • Need to adapt to local text – Efficient – Secure • Customizable ensemble de-identification system – – – – Build a de-identified corpus Incorporate existing de-id modules Launch as virtual machine Iterative training, evaluation, and modification by user • Correct mistakes • Add regular expressions Brett South, Stephane Meystre, Oscar Fernandez, Danielle Mowery Increase access to textual features TextVect TextVect Select level Select features Select output • Sentence • Document • Lexical: • Syntactic: • Semantic: N-gram Part-of-speech tags UMLS codes • Feature vector • Train classifier NLM: Abhishek Kumar Decrease the Burden of Customizing an NLP Application collaborative Knowledge Authoring Support Service (cKass) Customizing an IE App IE Output Map User’s Concepts Cough Dyspnea Infiltrate on CXR Wheezing Fever Cervical Lymphadenopathy Customizing an IE App IE Output Dry cough Productive cough Cough Hacking cough Bloody cough User’s Concepts Cough Dyspnea Infiltrate on CXR Wheezing Fever Cervical Lymphadenopathy Customizing an IE App IE Output Temp 38.0C Low-grade temperature User’s Concepts Cough Dyspnea Infiltrate on CXR Wheezing Fever Cervical Lymphadenopathy Customizing an IE App IE Output NECK: no adenopathy Disorder: adenopathy Negation: negated User’s Concepts Cough Dyspnea Infiltrate on CXR Wheezing Fever Cervical Lymphadenopathy KOS-IE Knowledge Organization Systems for Information Extraction Compile information helpful for IE "x-ray pneumothorax"@en respiratorySyndrome "air in the pleural space on x-ray"@en broader preferred label alternative label "xray pneumothorax"@en alternative label xRayPneumothorax data category "symptom" data category modified "chest_radiography" isAssociatedWithDisease definition pneumothoraxDX 2011-03-31 "Air between the lung and the chest wall seen on chest roentgenogram" Collaborative Knowledge Base Development: cKASS Radiologist NLP Tools Physician Radiologist Nurse Clinical Researcher Knowledge Engineer. Decision Support System User KB Shared KB External KB LQ Wang, M Conway, F Fana, M Tharp, D Hillert Knowledge Authoring Augment user KB with lexical variants, synonyms, and related concepts • User-driven authoring – Top-down: Provide access to external knowledge sources • UMLS, Specialist Lexicon, Bioportal – Bottom-up: Annotate to derive synonyms • Recommendation-based authoring – Generate lexical variants – Mine external knowledge sources – Mine patient records Decrease the Burden of Evaluation & Error Analysis Evaluation workbench Evaluation Workbench • Compare the output of two NLP annotators on clinical text • NLP system vs human annotation • View annotations • Calculate outcome measures • Drill down to all levels of annotation • Document-level • Perform error analysis • Future versions will support formal error analysis Levels of Annotation • Document – Report classified as Shigellosis • Group – Section classified as Past Medical History Section • Utterance – Group of text classified as Sentence • Snippet – “chest pain” classified as CUI 058273 • Word – “pain” classified as noun) • Token – “.” classified as EOS marker Document & annotations Outcome Measures for Selected Annotations Report List Attributes for Selected Annotation Select Classifications to View Relationships for Selected Annotation VA and ONC SHARP: Christensen, Murphy, Frabetti, Rodriguez, Savova Decrease the Burden of Annotation Annotation Environment Challenges to Annotating • Time consuming – Recruiting & training annotators for high agreement • Expensive – Domain experts especially expensive – Need for annotation by multiple people • Challenging to design annotation task – How many annotators? – How should I quantify quality of annotations? • Logistically challenging – Managing files and batches of reports – Setting up annotation tool • Reinventing the wheel – Hasn’t someone created a schema for this before? How can we reduce the burden of annotation? iDASH Annotation Environment Goal: provide an environment to decrease the Burden of annotation for research and application Annotator Registry Annotation Admin Web application iDASH cloud eHOST Client app on your computer VA, SHARP, and NIGMS : S Duvall, B South, G Savova, N Elhadad, H Hochheiser Annotator Registry • Enlist for annotation • Certify for annotation tasks – Personal health information – Part-of-speech tagging – UMLS mapping • Set pay rate • Searchable • Available for inclusion in new annotation task http://idash.ucsd.edu/nlp-annotator-registry Annotation Admin: Intended Users & Uses Users • NLP researchers • Annotation administrators Uses • Manage annotation projects – who annotates what – Currently done with hundreds of files on hard drive • Integrate with annotation tool (eHOST) – Download batches of raw reports to annotators – Upload and store annotated reports • Manage simple annotation projects • Facilitate distributed annotation Annotation Admin 1. Assign annotators to a task 2. Create a Schema 3. Assign users and set time expectations 3. Keep track of progress Collaborative Effort to Build Resources DeIdentification Tools & Services TextVect Increase access to NLP Virtual Machines Registry Evaluation Workbench Collaborative Knowledge Authoring Annotation Environment Decrease Burden of Developing NLP Conclusion • More demand for EHR data – NLP has potential to extend value of narrative clinical reports • There have been many barriers – To development – To deployment • Recent developments facilitate collaboration & sharing – – – – Common annotation conventions Privacy algorithms Shared datasets Hosted environments • iDASH hopes to facilitate – Development of NLP – Application of NLP Integrating Data for Analysis, Anonymization, and Sharing Questions | Discussion iDASH/ShARe Workshop on Annotation September 29, 2012 La Jolla, CA Division of Biomedical Informatics University of California, San Diego wwchapman@ucsd.edu