2007-12-18_Rebecca_Crowley_talk_slides

advertisement

ODIE Toolkit

NCBO Council Talk

December 18, 2007

Rebecca Crowley crowleyrs@upmc.edu

1

Outline

 Overview of the Project

 Aims, People, Organization, Domain, Philosophy

 Specific Aims from a use case approach

 Information Extraction

 Ontology Enrichment

 First steps, synergies, and year 1work, working together

Slide # 2/35

Project Overview

 Funded by National Cancer Center

Develop tools for

 Information extraction from clinical text using ontologies

 Enrichment of ontologies using clinical text

Project Period: 9/27/2007 – 7/31/2011

Collaboration with National Center for Biomedical

Ontology

 Subcontract to Stanford (consultation on Bioportal)

 Subcontract to Mayo (Terminologies, NLP)

Slide # 3/35

Specific Aims

Year 1 development goals

Specific Aim 1: Develop and evaluate methods for information extraction

(IE) tasks using existing OBO ontologies, including :

1.

Named Entity Recognition

2.

Co-reference Resolution

3.

Discourse Reasoning

4.

Attribute Value Extraction

Specific Aim 2: Develop and evaluate general methods for clinical-text mining to assist in ontology development, including:

1.

Preprocessing

2.

Concept Discovery and Clustering

3.

Suggest taxonomic positioning and relationships

4.

Specific Aim 3: Develop reusable software for performing information extraction and ontology development leveraging existing NCBO tools and compatible with NCBO architecture.

Specific Aim 4: Enhance National Cancer Institute Thesaurus Ontology using the ODIE toolkit.

Specific Aim 5: Test the ability of the resulting software and ontologies to address important translational research questions in hematologic cancers.

Slide # 4/35

Dual Proposal Goals

Slide # 5/35

People @pitt

Wendy Chapman, co-I

Rebecca Crowley, PI

Preet Chaudhary, co-I

Kaihong Liu, Graduate Student

Kevin Mitchell, Architect

Girish Chavan, Interfaces

John Dowling, Annotation

Slide # 6/35

Organization

Annotations

Develop manually annotated sets for training and testing

Rebecca Crowley

Wendy Chapman

Kaihong Liu

John Dowling

Algorithms Architecture

Consider and test existing algorithms; design, implement and test new algorithms

Develop and implement architecture

Rebecca Crowley

Wendy Chapman

Kaihong Liu

Kevin Mitchell

Rebecca Crowley

Kevin Mitchell

Girish Chavan

Slide # 7/35

Domain

 Will attempt to develop general tools whenever possible

 Priorities for evaluation of components in :

 Radiology and pathology reports

 NCIT as well as other clinically relevant OBO ontologies

 Cancer domains (including hematologic oncology)

Slide # 8/35

Philosophy

Toolkit for developers of NLP applications and ontologies

Support interaction and experimentation

Package systems at the conclusion of working with ODIE

Foster cycle of enrichment and extraction needed to advance development of NLP systems

Ontology enrichment as opposed to denovo development

Human-machine collaboration as opposed to fully automated learning

Slide # 9/35

Specific Aims

Key ODIE Functionality

Specific Aim 1: Develop and evaluate methods for information extraction

(IE) tasks using existing OBO ontologies, including :

1.

Named Entity Recognition

2.

Co-reference Resolution

3.

Discourse Reasoning

4.

Attribute Value Extraction

Specific Aim 2: Develop and evaluate general methods for clinical-text mining to assist in ontology development, including:

1.

Preprocessing

2.

Concept Discovery and Clustering

3.

Suggest taxonomic positioning and relationships

4.

Specific Aim 3: Develop reusable software for performing information extraction and ontology development leveraging existing NCBO tools and compatible with NCBO architecture.

Specific Aim 4: Enhance National Cancer Institute Thesaurus Ontology using the ODIE toolkit.

Specific Aim 5: Test the ability of the resulting software and ontologies to address important translational research questions in hematologic cancers.

Slide # 10/35

Named Entity Recognition

User has

 clinical documents

 one or more ontology

 (and/or) one or more lexical resources (synonyms, POS)

 (optionally) a reference standard of human annotations

User wants to

 determine degree of coverage of different ontologies with text

 determine degree of overlap in annotations generated between ontologies

 (optionally) test accuracy of NER with different ontologies to choose ‘best’ ontology to annotate text with

 tag existing document set with concepts from ontology (optionally using the synonyms from their synonym source if not in ontology)

System produces annotated clinical documents and descriptive statistics

Slide # 11/35

Named Entity Recognition

Clinical Document

Reproductive System Neoplasm Neoplasm

Prostate Neoplasm

Malignant Prostate Neoplasm

Malignant Prostate Neoplasm

Invasive Prostate Carcinoma

Prostatic adenocarcinoma

Ontology Lexical Resource

Metathesaurus (synonyms)

SPECIALIST (POS information)

Slide # 12/35

Named Entity Recognition

View Annotated Concepts From A Single Ontology

Slide # 13/35

Named Entity Recognition

Compare Annotations from Multiple Ontologies

Slide # 14/35

Co-reference Resolution

User has

 clinical documents with NER annotations

 one or more ontology

 (optionally) a reference standard of co-reference annotations

User wants to

 visualize co-references detected using one or more ontologies

 (optionally) test accuracy of CR with different ontologies to choose ontology for annotations

 tag existing document set with co-references from ontology

System produces annotated clinical documents and descriptive statistics

Slide # 15/35

Co-reference Resolution

Reproductive System Neoplasm Neoplasm

Prostate Neoplasm

Malignant Prostate Neoplasm

Malignant Prostate Neoplasm

Invasive Prostate Carcinoma

Prostatic adenocarcinoma

Slide # 16/35

Discourse Reasoning

 User has

 a set of clinical documents with NER and CR annotations

 a set of information models about those documents

 User wants to

 determine which information model (or parts of them) should be used for which clinical document

Slide # 17/35

Discourse Reasoning

BRAIN, RIGHT PARIETAL, STEROTACTIC BIOPSY:

Mucinous Adenocarcinoma, consistent with previous history of colon primary

BRAIN

Site

Morphology

COLON

Location

Grade

Size

TNM Stage

Slide # 18/35

Attribute Value Extraction

 User has

 clinical documents with NER, CR, DR annotations

 information model of specific subset of documents

 Wants to extract attributes and value from clinical text conforming to model

 Analyze data using common tools

 possible later search for particular cases

Slide # 19/35

Attribute Value Extraction

Histologic Type

Clark’s Level

Breslow Depth

Mitoses

Ulcer

Perineural Invasion

Angiolymphatic Invasion

Regression

Slide # 20/35

Attribute Value Extraction

Histologic Type – Superficial Spreading

Clark’s Level – IV

Breslow Depth – 1.75 mm

Mitoses – Greater than 2 per HLP

Ulcer – None

Perineural Invasion – None

Angiolymphatic Invasion – None

Slide # 21/35

Regression - None

Ontology Enrichment

 User has

 clinical documents

 Ontology

 User wants to identify potential candidate concepts from the documents to include in the ontology

 Visualized in a manner to ease search and recognition of presence of absence of those concepts in the ontology

 Suggestions for where in taxonomy the concept should be placed

 Suggestions for relationships

Slide # 22/35

Ontology Enrichment

Breast, Left, Excisional Biopsy:

Mucinous Carcinoma

Breast, Right, Lumpectomy:

Infiltrating Ductal Carcinoma

Breast, Left:

Invasive Ductal Carcinoma

Breast, Left, Excisional Biopsy:

Malignant Phylloides Tumor

Tumor shows osseous and lipomatous metaplasia

Disease or Disorder

Breast Disorder

Breast Neoplasm

Malignant Breast Neoplasm

Breast Carcinoma

Ductal Breast Carcinoma

Invasive Ductal Carcinoma

Slide # 23/35

Concept Discovery

Breast, Left, Excisional Biopsy:

Mucinous Carcinoma

Breast, Right, Lumpectomy:

Infiltrating Ductal Carcinoma

Breast, Left:

Invasive Ductal Carcinoma

Breast, Left, Excisional Biopsy:

Malignant Phylloides Tumor

Tumor shows osseous and lipomatous metaplasia

Disease or Disorder

Breast Disorder

Breast Neoplasm

Malignant Breast Neoplasm

Breast Carcinoma

Ductal Breast Carcinoma

Invasive Ductal Carcinoma

Slide # 24/35

Taxonomic Positioning

Breast, Left, Excisional Biopsy:

Mucinous Carcinoma

Breast, Right, Lumpectomy:

Infiltrating Ductal Carcinoma

Breast, Left:

Invasive Ductal Carcinoma

Breast, Left, Excisional Biopsy:

Malignant Phylloides Tumor

Tumor shows osseous and lipomatous metaplasia

Disease or Disorder

Breast Disorder

Breast Neoplasm

Malignant Breast Neoplasm

Breast Carcinoma

Ductal Breast Carcinoma

Invasive Ductal Carcinoma

Mucinous Carcinoma

Malignant Phylloides Tumor

Slide # 25/35

Relationships

Breast, Left, Excisional Biopsy:

Mucinous Carcinoma

Breast, Right, Lumpectomy:

Infiltrating Ductal Carcinoma

Breast, Left:

Invasive Ductal Carcinoma

Breast, Left, Excisional Biopsy:

Malignant Phylloides Tumor

Tumor shows osseous and lipomatous metaplasia

Disease or Disorder

Breast Disorder

Breast Neoplasm

Malignant Breast Neoplasm

Breast Carcinoma

Ductal Breast Carcinoma

Invasive Ductal Carcinoma

Mucinous Carcinoma

Malignant Phylloides Tumor has-Finding

Morphologic Finding

Metaplasia

Osseous metaplasia

Lipomatous metaplasia

Cartilageous metaplasia

Slide # 26/35

First Steps

 Use cases

 Survey of Bioportal, LexBio, GATE and UIMA

 Survey of ontology enrichment techniques

 Architectural assumptions and notional architecture

 Started discussions with Stanford and Mayo

 Delineated first year work

 Annotation software and document sets

Slide # 27/35

Architecture Decisions

The primary goal of ODIE is to serve as a workbench for building and refining text processing pipelines and ontologies.

Information retrieval is not a primary goal. However ODIE may have a rudimentary search feature for annotated document collections.

 ODIE Toolkit will be a desktop application.

 ODIE UI will be based on the Eclipse Rich Client Platform.

 ODIE will use UIMA as the Language Engineering Platform. GATE processing resources will be usable in ODIE by wrapping them in UIMA TAEs.

 UIMA is highly configurable using xml descriptor files.

 Better documentation, community support.

 We will use GATE in first year for rapid prototyping and manual annotation

 ODIE will have the ability to easily import and use UIMA TAEs developed by others.

This may be expanded to GATE processing resources .

 ODIE will allow for packaging a pipeline for deployment in a production environment.

Slide # 28/35

Notional Architecture

Slide # 29/35

Synergies: Ontrez

• Information Retrieval

• Range of inputs

Ontrez

ODIE

• Annotation

• Named Entity Recognition

• Enhance annotation of Ontrez?

• Use inference and indexing on clinical documents?

• Other kinds of annotation

• Information Extraction

• Ontology Enrichment

• Clinical Documents

Slide # 30/35

Synergies: Mayo

 NER and Co-reference resolution

 Clustering, discovery of synonyms

 LexGrid

Using similar tools, focused on larger range of document types

More – to be explored

Slide # 31/35

First Year Work

 NER and co-reference modules

 Concept discovery

 Develop manually annotated reference standards for NER and CR

 Focus on testing and developing algorithms

 ODIE 1.0 will include basic architecture and modules for NER, CR and concept discovery, statistics

Slide # 32/35

Working Together

 Work with Mayo to scope first year collaboration (NER, CR, synonym discovery)

 Decisions regarding terminology access

 Better define what NCBO resources we will use

Slide # 33/35

Working Together

 SourceForge site, ODIE website and Wiki

 All our meetings are open and we are happy to arrange teleconferences

 Mondays 2-4 pm (EST)

Schedule visits with Mayo and Stanford for early spring ’08

Anticipate providing monthly progress updates at the ODIE website starting in January ‘08

Other ideas? What’s the expectation of the Council?

Slide # 34/35

Questions?

Comments?

35

Download