Rebecca Crowley crowleyrs@upmc.edu
1
Overview of the Project
Aims, People, Organization, Domain, Philosophy
Specific Aims from a use case approach
Information Extraction
Ontology Enrichment
First steps, synergies, and year 1work, working together
Slide # 2/35
Funded by National Cancer Center
Develop tools for
Information extraction from clinical text using ontologies
Enrichment of ontologies using clinical text
Project Period: 9/27/2007 – 7/31/2011
Collaboration with National Center for Biomedical
Ontology
Subcontract to Stanford (consultation on Bioportal)
Subcontract to Mayo (Terminologies, NLP)
Slide # 3/35
Year 1 development goals
Specific Aim 1: Develop and evaluate methods for information extraction
(IE) tasks using existing OBO ontologies, including :
1.
Named Entity Recognition
2.
Co-reference Resolution
3.
Discourse Reasoning
4.
Attribute Value Extraction
Specific Aim 2: Develop and evaluate general methods for clinical-text mining to assist in ontology development, including:
1.
Preprocessing
2.
Concept Discovery and Clustering
3.
Suggest taxonomic positioning and relationships
4.
Specific Aim 3: Develop reusable software for performing information extraction and ontology development leveraging existing NCBO tools and compatible with NCBO architecture.
Specific Aim 4: Enhance National Cancer Institute Thesaurus Ontology using the ODIE toolkit.
Specific Aim 5: Test the ability of the resulting software and ontologies to address important translational research questions in hematologic cancers.
Slide # 4/35
Slide # 5/35
Wendy Chapman, co-I
Rebecca Crowley, PI
Preet Chaudhary, co-I
Kaihong Liu, Graduate Student
Kevin Mitchell, Architect
Girish Chavan, Interfaces
John Dowling, Annotation
Slide # 6/35
Annotations
Develop manually annotated sets for training and testing
Rebecca Crowley
Wendy Chapman
Kaihong Liu
John Dowling
Algorithms Architecture
Consider and test existing algorithms; design, implement and test new algorithms
Develop and implement architecture
Rebecca Crowley
Wendy Chapman
Kaihong Liu
Kevin Mitchell
Rebecca Crowley
Kevin Mitchell
Girish Chavan
Slide # 7/35
Will attempt to develop general tools whenever possible
Priorities for evaluation of components in :
Radiology and pathology reports
NCIT as well as other clinically relevant OBO ontologies
Cancer domains (including hematologic oncology)
Slide # 8/35
Toolkit for developers of NLP applications and ontologies
Support interaction and experimentation
Package systems at the conclusion of working with ODIE
Foster cycle of enrichment and extraction needed to advance development of NLP systems
Ontology enrichment as opposed to denovo development
Human-machine collaboration as opposed to fully automated learning
Slide # 9/35
Key ODIE Functionality
Specific Aim 1: Develop and evaluate methods for information extraction
(IE) tasks using existing OBO ontologies, including :
1.
Named Entity Recognition
2.
Co-reference Resolution
3.
Discourse Reasoning
4.
Attribute Value Extraction
Specific Aim 2: Develop and evaluate general methods for clinical-text mining to assist in ontology development, including:
1.
Preprocessing
2.
Concept Discovery and Clustering
3.
Suggest taxonomic positioning and relationships
4.
Specific Aim 3: Develop reusable software for performing information extraction and ontology development leveraging existing NCBO tools and compatible with NCBO architecture.
Specific Aim 4: Enhance National Cancer Institute Thesaurus Ontology using the ODIE toolkit.
Specific Aim 5: Test the ability of the resulting software and ontologies to address important translational research questions in hematologic cancers.
Slide # 10/35
User has
clinical documents
one or more ontology
(and/or) one or more lexical resources (synonyms, POS)
(optionally) a reference standard of human annotations
User wants to
determine degree of coverage of different ontologies with text
determine degree of overlap in annotations generated between ontologies
(optionally) test accuracy of NER with different ontologies to choose ‘best’ ontology to annotate text with
tag existing document set with concepts from ontology (optionally using the synonyms from their synonym source if not in ontology)
System produces annotated clinical documents and descriptive statistics
Slide # 11/35
Clinical Document
Reproductive System Neoplasm Neoplasm
Prostate Neoplasm
Malignant Prostate Neoplasm
Malignant Prostate Neoplasm
Invasive Prostate Carcinoma
Prostatic adenocarcinoma
Ontology Lexical Resource
Metathesaurus (synonyms)
SPECIALIST (POS information)
Slide # 12/35
View Annotated Concepts From A Single Ontology
Slide # 13/35
Compare Annotations from Multiple Ontologies
Slide # 14/35
User has
clinical documents with NER annotations
one or more ontology
(optionally) a reference standard of co-reference annotations
User wants to
visualize co-references detected using one or more ontologies
(optionally) test accuracy of CR with different ontologies to choose ontology for annotations
tag existing document set with co-references from ontology
System produces annotated clinical documents and descriptive statistics
Slide # 15/35
Reproductive System Neoplasm Neoplasm
Prostate Neoplasm
Malignant Prostate Neoplasm
Malignant Prostate Neoplasm
Invasive Prostate Carcinoma
Prostatic adenocarcinoma
Slide # 16/35
User has
a set of clinical documents with NER and CR annotations
a set of information models about those documents
User wants to
determine which information model (or parts of them) should be used for which clinical document
Slide # 17/35
BRAIN, RIGHT PARIETAL, STEROTACTIC BIOPSY:
Mucinous Adenocarcinoma, consistent with previous history of colon primary
BRAIN
Site
Morphology
COLON
Location
Grade
Size
TNM Stage
Slide # 18/35
User has
clinical documents with NER, CR, DR annotations
information model of specific subset of documents
Wants to extract attributes and value from clinical text conforming to model
Analyze data using common tools
possible later search for particular cases
Slide # 19/35
Histologic Type
Clark’s Level
Breslow Depth
Mitoses
Ulcer
Perineural Invasion
Angiolymphatic Invasion
Regression
Slide # 20/35
Histologic Type – Superficial Spreading
Clark’s Level – IV
Breslow Depth – 1.75 mm
Mitoses – Greater than 2 per HLP
Ulcer – None
Perineural Invasion – None
Angiolymphatic Invasion – None
Slide # 21/35
Regression - None
User has
clinical documents
Ontology
User wants to identify potential candidate concepts from the documents to include in the ontology
Visualized in a manner to ease search and recognition of presence of absence of those concepts in the ontology
Suggestions for where in taxonomy the concept should be placed
Suggestions for relationships
Slide # 22/35
Breast, Left, Excisional Biopsy:
Mucinous Carcinoma
Breast, Right, Lumpectomy:
Infiltrating Ductal Carcinoma
Breast, Left:
Invasive Ductal Carcinoma
Breast, Left, Excisional Biopsy:
Malignant Phylloides Tumor
Tumor shows osseous and lipomatous metaplasia
Disease or Disorder
Breast Disorder
Breast Neoplasm
Malignant Breast Neoplasm
Breast Carcinoma
Ductal Breast Carcinoma
Invasive Ductal Carcinoma
Slide # 23/35
Breast, Left, Excisional Biopsy:
Mucinous Carcinoma
Breast, Right, Lumpectomy:
Infiltrating Ductal Carcinoma
Breast, Left:
Invasive Ductal Carcinoma
Breast, Left, Excisional Biopsy:
Malignant Phylloides Tumor
Tumor shows osseous and lipomatous metaplasia
Disease or Disorder
Breast Disorder
Breast Neoplasm
Malignant Breast Neoplasm
Breast Carcinoma
Ductal Breast Carcinoma
Invasive Ductal Carcinoma
Slide # 24/35
Breast, Left, Excisional Biopsy:
Mucinous Carcinoma
Breast, Right, Lumpectomy:
Infiltrating Ductal Carcinoma
Breast, Left:
Invasive Ductal Carcinoma
Breast, Left, Excisional Biopsy:
Malignant Phylloides Tumor
Tumor shows osseous and lipomatous metaplasia
Disease or Disorder
Breast Disorder
Breast Neoplasm
Malignant Breast Neoplasm
Breast Carcinoma
Ductal Breast Carcinoma
Invasive Ductal Carcinoma
Mucinous Carcinoma
Malignant Phylloides Tumor
Slide # 25/35
Breast, Left, Excisional Biopsy:
Mucinous Carcinoma
Breast, Right, Lumpectomy:
Infiltrating Ductal Carcinoma
Breast, Left:
Invasive Ductal Carcinoma
Breast, Left, Excisional Biopsy:
Malignant Phylloides Tumor
Tumor shows osseous and lipomatous metaplasia
Disease or Disorder
Breast Disorder
Breast Neoplasm
Malignant Breast Neoplasm
Breast Carcinoma
Ductal Breast Carcinoma
Invasive Ductal Carcinoma
Mucinous Carcinoma
Malignant Phylloides Tumor has-Finding
Morphologic Finding
Metaplasia
Osseous metaplasia
Lipomatous metaplasia
Cartilageous metaplasia
Slide # 26/35
Use cases
Survey of Bioportal, LexBio, GATE and UIMA
Survey of ontology enrichment techniques
Architectural assumptions and notional architecture
Started discussions with Stanford and Mayo
Delineated first year work
Annotation software and document sets
Slide # 27/35
The primary goal of ODIE is to serve as a workbench for building and refining text processing pipelines and ontologies.
Information retrieval is not a primary goal. However ODIE may have a rudimentary search feature for annotated document collections.
ODIE Toolkit will be a desktop application.
ODIE UI will be based on the Eclipse Rich Client Platform.
ODIE will use UIMA as the Language Engineering Platform. GATE processing resources will be usable in ODIE by wrapping them in UIMA TAEs.
UIMA is highly configurable using xml descriptor files.
Better documentation, community support.
We will use GATE in first year for rapid prototyping and manual annotation
ODIE will have the ability to easily import and use UIMA TAEs developed by others.
This may be expanded to GATE processing resources .
ODIE will allow for packaging a pipeline for deployment in a production environment.
Slide # 28/35
Slide # 29/35
• Information Retrieval
• Range of inputs
Ontrez
ODIE
• Annotation
• Named Entity Recognition
• Enhance annotation of Ontrez?
• Use inference and indexing on clinical documents?
• Other kinds of annotation
• Information Extraction
• Ontology Enrichment
• Clinical Documents
Slide # 30/35
NER and Co-reference resolution
Clustering, discovery of synonyms
LexGrid
Using similar tools, focused on larger range of document types
More – to be explored
Slide # 31/35
NER and co-reference modules
Concept discovery
Develop manually annotated reference standards for NER and CR
Focus on testing and developing algorithms
ODIE 1.0 will include basic architecture and modules for NER, CR and concept discovery, statistics
Slide # 32/35
Work with Mayo to scope first year collaboration (NER, CR, synonym discovery)
Decisions regarding terminology access
Better define what NCBO resources we will use
Slide # 33/35
SourceForge site, ODIE website and Wiki
All our meetings are open and we are happy to arrange teleconferences
Mondays 2-4 pm (EST)
Schedule visits with Mayo and Stanford for early spring ’08
Anticipate providing monthly progress updates at the ODIE website starting in January ‘08
Other ideas? What’s the expectation of the Council?
Slide # 34/35
35