Natural Language Processing for Biosurveillance

advertisement
Integrating Data for Analysis, Anonymization, and Sharing
An NLP Ecosystem
for Development and Use
of Natural Language Processing
in the Clinical Domain
Wendy W. Chapman, PhD
Division of Biomedical Informatics
University of California, San Diego
Overview
• The promise of natural language processing (NLP)
• Challenges of developing NLP in the clinical domain
• Challenges in applying NLP in the clinical domain
• iDASH
• Opportunities for sharing and collaboration in NLP
NLP Success
Fresh off its butt-kicking performance on Jeopardy!, IBM’s
supercomputer "Watson" has enrolled in medical school at
Columbia University,” New York Daily News February 18th
2011
Clinical NLP Since 1960’s
Why has clinical NLP had little impact on
clinical care?
Barriers to Development
• Sharing clinical data difficult
– Have not had shared datasets for development and
evaluation
– Modules trained on general English not sufficient
• Insufficient common conventions and standards for
annotations
– Data sets are unique to a lab
– Not easily interchangeable
• Limited collaboration
– Clinical NLP applications silos and black boxes
– Have not had open source applications
• Reproducibility is formidable
– Open source release not always sufficient
– Software engineering quality not always great
– Mechanisms for reproducing results are sparse
Overview
• The promise of natural language processing (NLP)
• Challenges of developing NLP in the clinical domain
• Challenges in applying NLP in the clinical domain
• Developing an NLP ecosystem on iDASH
Security & Privacy Concerns
• Clinical texts have many patient identifiers
– 18 HIPAA identifiers
• Names
• Addresses
• Items not regulated by HIPAA
Institutions
are
reluctant
to
share
data
– tight end for the Steelers
• Unique cases
– 50s-year-old woman who is pregnant
• Sensitive information
– HIV status
Lack of user-centered development and scalability
– Perceived cost of applying NLP outweighs the
perceived benefit (Len D’Avolio)
Overview
• The promise of natural language processing (NLP)
• Challenges of developing NLP in the clinical domain
• Challenges in applying NLP in the clinical domain
• Developing an NLP ecosystem on iDASH
iDASH
• integrating Data
• Analysis
• Anonymization
• Sharing
Data
Software/Tools
Computational
Resources
Disincentives to Share
• ‘Scooping’ by faster analysts Exposure of
potential errors in data
• Resources for preparing data submissions
• Maintaining data
• Interacting with potential users takes time
• Threat of privacy breach when human subjects
are involved
– Do not have policies in place
– Fallible de-identification, anonymization algorithms
iDASH aims to minimize these disincentives
nlp-ecosystem.ucsd.edu
HIPAA &/or FISMA Compliant Cloud
•
•
•
•
Access control
De-identification
Query counts
Artificial data
generators
Digital
Informed
consent
Privacy
preserving
Informed
Consent
Registry
Customizable
DUAs
Schemas
Bibliography
Tutorials
Research
Guidelines
Resources
Education
NLP
Ecosystem
Data
DeIdentification
Tools &
Services
TxtVect
Virtual
Machines
Registry
UCSD Clinical
Data
Evaluation
Workbench
MT Samples
Collaborative
Development
Tools
Annotation
Admin &
eHOST
2011 summer internship program funded by NIH U54HL108460
15
Collaborative Effort to Build Ecosystem
DeIdentification
Tools &
Services
TextVect
Increase
access
to NLP
Virtual
Machines
Registry
Evaluation
Workbench
Collaborative
Knowledge
Authoring
Annotation
Environment
Decrease
Burden of
Developing
NLP
Increase ability to find NLP tools
orbit
Registry: orbit.nlm.nih.gov
Len D’Avolio, Dina Demner-Fushman
Increase access to clinical text
De-identification service
De-identification
• Several available de-identification modules
• Need to adapt to local text
– Efficient
– Secure
• Customizable ensemble de-identification system
–
–
–
–
Build a de-identified corpus
Incorporate existing de-id modules
Launch as virtual machine
Iterative training, evaluation, and modification by user
• Correct mistakes
• Add regular expressions
Brett South, Stephane Meystre, Oscar Fernandez, Danielle Mowery
Increase access to textual features
TextVect
TextVect
Select
level
Select
features
Select
output
• Sentence
• Document
• Lexical:
• Syntactic:
• Semantic:
N-gram
Part-of-speech tags
UMLS codes
• Feature vector
• Train classifier
NLM: Abhishek Kumar
Decrease the Burden of Customizing
an NLP Application
collaborative Knowledge Authoring
Support Service (cKass)
Customizing an IE App
IE Output
Map
User’s Concepts
Cough
Dyspnea
Infiltrate on CXR
Wheezing
Fever
Cervical
Lymphadenopathy
Customizing an IE App
IE Output
Dry cough
Productive cough
Cough
Hacking cough
Bloody cough
User’s Concepts
Cough
Dyspnea
Infiltrate on CXR
Wheezing
Fever
Cervical
Lymphadenopathy
Customizing an IE App
IE Output
Temp 38.0C
Low-grade
temperature
User’s Concepts
Cough
Dyspnea
Infiltrate on CXR
Wheezing
Fever
Cervical
Lymphadenopathy
Customizing an IE App
IE Output
NECK: no adenopathy
Disorder: adenopathy
Negation: negated
User’s Concepts
Cough
Dyspnea
Infiltrate on CXR
Wheezing
Fever
Cervical
Lymphadenopathy
KOS-IE
Knowledge Organization Systems for Information Extraction
Compile information helpful for IE
"x-ray pneumothorax"@en
respiratorySyndrome
"air in the pleural space
on x-ray"@en
broader
preferred label
alternative label
"xray pneumothorax"@en
alternative label
xRayPneumothorax
data category
"symptom"
data category
modified
"chest_radiography"
isAssociatedWithDisease
definition
pneumothoraxDX
2011-03-31
"Air between the lung and the chest wall
seen on chest roentgenogram"
Collaborative Knowledge Base Development: cKASS
Radiologist
NLP Tools
 Physician
 Radiologist
 Nurse
 Clinical Researcher
 Knowledge Engineer.
Decision
Support
System
User KB
Shared KB
External KB
LQ Wang, M Conway, F Fana, M Tharp, D Hillert
Knowledge Authoring
Augment user KB with lexical variants, synonyms,
and related concepts
• User-driven authoring
– Top-down: Provide access to external knowledge sources
• UMLS, Specialist Lexicon, Bioportal
– Bottom-up: Annotate to derive synonyms
• Recommendation-based authoring
– Generate lexical variants
– Mine external knowledge sources
– Mine patient records
Decrease the Burden of Evaluation &
Error Analysis
Evaluation workbench
Evaluation Workbench
• Compare the output of two NLP annotators on
clinical text
• NLP system vs human annotation
• View annotations
• Calculate outcome measures
• Drill down to all levels of annotation
• Document-level
• Perform error analysis
• Future versions will support formal error analysis
Levels of Annotation
• Document
– Report classified as Shigellosis
• Group
– Section classified as Past Medical History Section
• Utterance
– Group of text classified as Sentence
• Snippet
– “chest pain” classified as CUI 058273
• Word
– “pain” classified as noun)
• Token
– “.” classified as EOS marker
Document &
annotations
Outcome Measures for
Selected Annotations
Report
List
Attributes for
Selected
Annotation
Select
Classifications
to View
Relationships for
Selected
Annotation
VA and ONC SHARP: Christensen, Murphy, Frabetti, Rodriguez, Savova
Decrease the Burden of Annotation
Annotation Environment
Challenges to Annotating
• Time consuming
– Recruiting & training annotators for high agreement
• Expensive
– Domain experts especially expensive
– Need for annotation by multiple people
• Challenging to design annotation task
– How many annotators?
– How should I quantify quality of annotations?
• Logistically challenging
– Managing files and batches of reports
– Setting up annotation tool
• Reinventing the wheel
– Hasn’t someone created a schema for this before?
How can we reduce the burden of
annotation?
iDASH Annotation Environment
Goal: provide an environment to decrease the
Burden of annotation for research and application
Annotator
Registry
Annotation Admin
Web application
iDASH cloud
eHOST
Client app on your computer
VA, SHARP, and NIGMS : S Duvall, B South, G Savova, N Elhadad, H Hochheiser
Annotator Registry
• Enlist for annotation
• Certify for annotation tasks
– Personal health information
– Part-of-speech tagging
– UMLS mapping
• Set pay rate
• Searchable
• Available for inclusion in
new annotation task
http://idash.ucsd.edu/nlp-annotator-registry
Annotation Admin:
Intended Users & Uses
Users
• NLP researchers
• Annotation administrators
Uses
• Manage annotation projects – who annotates what
– Currently done with hundreds of files on hard drive
• Integrate with annotation tool (eHOST)
– Download batches of raw reports to annotators
– Upload and store annotated reports
• Manage simple annotation projects
• Facilitate distributed annotation
Annotation Admin
1. Assign annotators to a task
2. Create a Schema
3. Assign users and set time expectations
3. Keep track of progress
Collaborative Effort to Build Resources
DeIdentification
Tools &
Services
TextVect
Increase
access
to NLP
Virtual
Machines
Registry
Evaluation
Workbench
Collaborative
Knowledge
Authoring
Annotation
Environment
Decrease
Burden of
Developing
NLP
Conclusion
• More demand for EHR data
– NLP has potential to extend value of narrative clinical reports
• There have been many barriers
– To development
– To deployment
• Recent developments facilitate collaboration & sharing
–
–
–
–
Common annotation conventions
Privacy algorithms
Shared datasets
Hosted environments
• iDASH hopes to facilitate
– Development of NLP
– Application of NLP
Integrating Data for Analysis, Anonymization, and Sharing
Questions | Discussion
iDASH/ShARe Workshop on Annotation
September 29, 2012
La Jolla, CA
Division of Biomedical Informatics
University of California, San Diego
wwchapman@ucsd.edu
Download