Presentation to NIST (Maryland) 5_01_2012

advertisement
Using Domain Ontologies to Improve
Information Retrieval in Scientific
Publications
Kincho H. Law, Siddharth Taduri, Gloria
T. Lau
Engineering Informatics Lab at Stanford
University
Motivation
Regional
variability
PMID: 12897095
in the incidence
of
end-stage
of
end-stage
renal disease: an epidemiological approach.
….
Regional
variability
in
renal disease (ESRD)
the
incidence
in Austria is reported. Our aim was ….
low rates in the state of Tyrol.
….
ESRD incidence data were obtained from ….
….
Between 1995 and 1999, 4811 new cases of ESRD were recorded;
Synonyms for
ESRD
Tyrol (T) …. incidence of ESRD patients with type 2
diabetes mellitus …. the difference in the overall ESRD
incidence …. prevalence of DM, a highly significant correlation was
found between ESRD incidence and DM.
the state of
….
variability in the
End Stage Kidney Disease
…
Renal Disease, End Stage
….
Renal Failure, End Stage
….
Kidney Disease, Chronic
Renal Failure, Chronic
End-Stage Kidney Disease
ESRD
Renal Disease, End-Stage
Renal Failure, End-Stage
Chronic Kidney Failure
Chronic Renal Failure
ESRD incidence in Austria is explained mainly by
regional differences in DM-2. Data from similar studies …. allocation
for ESRD ….
….
05/01/2012
Engineering Informatics Lab at
Stanford University
2
Data Set and Knowledge
TREC 2007 Genomics Data Set
• Over 162,000 full-text scientific publications from 49
prominent journals in biomedicine
• Metadata available through MEDLINE
• Tasks involve passage, document, and feature retrieval
• Methodologies are evaluated on their response to 36
topics (‘queries’)
• The topics are categorized based on 13 entity types
(Proteins, Genes, etc.)
Domain Knowledge
• Over 250 biomedical ontologies from BioPortal
05/01/2012
Engineering Informatics Lab at
Stanford University
3
XML Representation of Scientific
Publications in PubMed
<PubmedArticle>
<MedlineCitation Owner="NLM" Status="MEDLINE">
<PMID>10022466</PMID>
<DateCreated>
<Year>1999</Year> <Month>02</Month> <Day>25</Day>
</DateCreated>
….
<Article PubModel="Print">
<Journal>
….
<JournalIssue CitedMedium="Print">
<Volume>84</Volume> <Issue>2</Issue>
….
</JournalIssue>
<Title>The Journal of clinical endocrinology and metabolism</Title>
<ISOAbbreviation>J. Clin. Endocrinol. Metab.</ISOAbbreviation>
</Journal>
<ArticleTitle>About the use … of an ACTH 1-39 ….</ArticleTitle>
….
05/01/2012
Engineering Informatics Lab at
Stanford University
4
Domain Knowledge Integration
(1) Annotating Documents prior to indexing
– Response time is fast
– Not flexible, the entire index has to be updated if
a new ontology needs to be added
– Indexes can grow very large
(2) Query Expansion
– Response time is slower
– Very flexible, ontologies can be dynamically
chosen
05/01/2012
Engineering Informatics Lab at
Stanford University
5
Query Expansion
MeSH
Tumor
Cancer, Neoplasm,
…
Synonyms
Melanoma
Adenocarcinoma
Leukemia
Synonyms
Leucocythaemias
Leucocythemia
Nerve Sheath Neo
• The pre-processed query is
expanded using BioPortal’s API
automatically
[Tumor][MeSH] => {Tumor, Neoplasm, Carcinoma,
Leukemia …}
05/01/2012
Engineering Informatics Lab at
Stanford University
6
Choosing Domain Knowledge
• The use of synonymy results in inconsistent
performance (2007 TREC genomics track)
• Common reasons include:
– Relevant terms may not be classified as expected
– Some relevant terms may not be classified in a
particular ontology
– Incomplete information (such as synonyms)
• Selection of the appropriate domain ontology is
important
05/01/2012
Engineering Informatics Lab at
Stanford University
7
Enriching Existing Ontologies
• Existing ontologies can be enriched to complete some missing
information
Ontology
NDF
Concept
Pamidronate
Synonyms from NDF
APD, Amidronate, ...
Synonyms from
MeSH
pamidronate calcium,
pamidronate monosodium,
aredia
Synonyms from NCI
Pamidronic acid, pamidronate
disodium, …
MeSH
NCI
• Multiple ontologies can be used to provide different
classifications
05/01/2012
Engineering Informatics Lab at
Stanford University
8
Evaluations
•
•
•
•
Baseline
With Query Expansion (Suggested Sources)
Using Enriched Ontologies
Multiple Query Expansions per query
Summary of Document MAP scores in 2007 TREC
genomics track
05/01/2012
Max
0.3286
Min
0.0329
Mean
0.1862
Median
0.1897
Engineering Informatics Lab at
Stanford University
9
Queries
Topic
Number
Query
Suggested
Sources for
Terms (TREC)
Selected Domain
Knowledge (Our
Methodology)
205
What [SIGNS OR SYMPTOMS] of anxiety disorder
are related to coronary artery disease?
Wikipedia
Symptom
Ontology
206
What [TOXICITIES] are associated with zoledronic
acid?
Wikipedia +
Aaron
NCI Thesaurus
207
What [TOXICITIES] are associated with etidronate?
Wikipedia +
Aaron
NCI Thesaurus
211
What [ANTIBODIES] have been used to detect
protein PSD-95?
MeSH
MeSH
229
What [SIGNS OR SYMPTOMS] are caused by human
parvovirus infection?
Wikipedia
Symptom
Ontology
231
What [TUMOR TYPES] are found in zebrafish?
Aaron
MeSH
05/01/2012
Engineering Informatics Lab at
Stanford University
10
Baseline
• Queries are used without modification, e.g.,
– “What [ANTIBODIES] have been used to detect
protein PSD-95?”
– “What [SIGNS OR SYMPTOMS] of anxiety disorder
are related to coronary artery disease?”
• Document MAP: 0.277
05/01/2012
Engineering Informatics Lab at
Stanford University
11
Query Expansion
• Original Query: What [TUMOR TYPES] are found
in zebrafish?
• Queries are formulated in ‘AND’ clauses:
“[Tumor][MeSH] AND zebrafish”
=>
(Tumor, Neoplasm, Carcinoma, Leukemia …) AND
zebrafish
• Document MAP: 0.347
05/01/2012
Engineering Informatics Lab at
Stanford University
12
Multiple Query Expansion Terms
• Expansion can be performed on multiple terms in
the query
• Example: Coronary Artery Disease => {Coronary
heart disease, coronary disease, CAD, …}
[Tumor][MeSH] AND zebrafish[MeSH}
=>
(tumor, neoplasm, …) AND (zebrafish, danio rerio, …)
• Document MAP: 0.352
05/01/2012
Engineering Informatics Lab at
Stanford University
13
Enriched Ontology – Current Status
• Marginal improvement over basic enhanced
models
• Document MAP: 0.352 (Marginal improvement
from 0.347)
• Issues:
– Framework for enrichment based on synonymy is
rigid, i.e., relevant terms that are entirely missing in
the ontology are still not included
– Relevant terms that are classified differently are never
included in the search
05/01/2012
Engineering Informatics Lab at
Stanford University
14
IR Tool
• Expert knowledge is valuable
• Developed a search tool which automatically
integrates with knowledge sources and searches
documents
• We extend MINOE, a co-occurrence based
visualization tool, originally designed for
exploring marine ecosystems
• User can browse (or search) documents through
ontologies and visualize interactions between
concepts
05/01/2012
Engineering Informatics Lab at
Stanford University
15
Snapshots of the Tool
05/01/2012
Engineering Informatics Lab at
Stanford University
16
I. Enter Query Terms
II. Domain Knowledge Integration
III. Shows Expanded Query, and
other filters that are added to
the search
05/01/2012
Engineering Informatics Lab at
Stanford University
17
TREC Topic 220
• Query: What [PROTEINS] are involved in the
activation or recognition mechanism for
PmrD?
• Domain Knowledge: MeSH
Depth of Hierarchical Expansion to Child Nodes Level 1
Level 2
Level 3
Document MAP
0.2
0.8
05/01/2012
0.0
Engineering Informatics Lab at
Stanford University
18
05/01/2012
Engineering Informatics Lab at
Stanford University
19
05/01/2012
Engineering Informatics Lab at
Stanford University
20
05/01/2012
Engineering Informatics Lab at
Stanford University
21
05/01/2012
Engineering Informatics Lab at
Stanford University
22
05/01/2012
Engineering Informatics Lab at
Stanford University
23
05/01/2012
Engineering Informatics Lab at
Stanford University
24
Changed
05/01/2012
Engineering Informatics Lab at
Stanford University
25
05/01/2012
Engineering Informatics Lab at
Stanford University
26
MeSH Descriptors
05/01/2012
Engineering Informatics Lab at
Stanford University
27
05/01/2012
Engineering Informatics Lab at
Stanford University
28
05/01/2012
Engineering Informatics Lab at
Stanford University
29
05/01/2012
Engineering Informatics Lab at
Stanford University
30
(>1500 Documents)
(>1500 Documents)
05/01/2012
Engineering Informatics Lab at
Stanford University
31
Stronger Association:
~270 Documents
Weaker Association: ~57
Documents
CHILD CONCEPTS
05/01/2012
Engineering Informatics Lab at
Stanford University
32
Retrieving Information Across Multiple
Diverse Information Sources
Patent System
Issued Patents
and
Applications
File Wrappers
Court Cases
Regulations
and Laws
05/01/2012
Technical
Publications
Technology Firms’ Concerns
• Can I get patent protection for my
innovation?
• Do I build or do I buy related
technologies?
• What are my competitors doing?
• How strong are their patents?
• Am I perhaps infringing on someone
else’s patents?
• Is so, are those patents valid?
• Have they been enforced in court?
• Has their validity been challenged in
court?
Engineering Informatics Lab at
Stanford University
33
Cross-Referencing between Information Sources
REGULATIONS:
U.S. Code Title 35, C. F. R Title 37, M. P. E. P. …
COURT CASE
314 F.3d 1313 (2003)
AMGEN INC., Plaintiff-Cross Appellant v. HOECHST
MARION ROUSSEL, INC. (now known as Aventis
Pharmaceuticals, Inc.) and Transkaryotic Therapies,
Inc., Defendants-Appellants.
…
Plaintiff-Cross Appellant Amgen Inc. is the owner of
numerous patents directed to the production
of erythropoietin ("EPO"), …alleging that TKT's
Investigational New Drug Application ("INDA")
infringed United States Patent Nos. 5,547,933;
5,618,698; and 5,621,080. The complaint was
amended in October 1999 to include United
States Patent Nos. 5,756,349 and 5,955,422, which
issued after suit was filed.
BIOPORTAL: DOMAIN KNOWLEDGE
Publication Database
PATENT
United States Patent, 5,955,422
September 21, 1999
Production of erthropoietin
Abstract: Disclosed are novel polypeptides
possessing part or all of the primary structural
conformation and one or more of the biological
properties of mammalian erythropoietin ("EPO")
…
FILE WRAPPER
U.S. Patent 5,955,422
…
Claims 61-63 are rejected under 35
U.S.C. § 103 as being unpatentable over
any one of Miyake et al., 1977 (R)
…
In accordance with the provisions of 37
C.F.R. §1.607, the present continuation
is being filed for the purpose of
…
Inventors: Lin; Fu-Kuen (Thousand Oaks, CA)
Assignee: Kirin-Amgen, Inc. (Thousand Oaks, CA)
Appl. No.: 08/100,197
Filed: August 2, 1993.
Solution: Patent System Ontology
05/01/2012
Engineering Informatics Lab at
Stanford University
34
Patent System Ontology
I.
Facilitate information integration across multiple diverse information
sources
• This requires a standardized representation (a formal semantic model)
- Patent System Ontology
II.
Integrate Domain Semantics into existing Information Retrieval and Text
mining methodologies to improve retrieval of information
05/01/2012
Engineering Informatics Lab at
Stanford University
35
Information Retrieval Framework
Patent System Ontology
05/01/2012
Engineering Informatics Lab at
Stanford University
36
Future Work
• Using multiple enriched ontologies may
provide the necessary terms
• MeSH Descriptors are provided for every
publication during indexing and can
potentially improve results
• Implement Okapi model for scoring
documents
05/01/2012
Engineering Informatics Lab at
Stanford University
37
Thank You
05/01/2012
Engineering Informatics Lab at
Stanford University
38
Backup Slides
05/01/2012
Engineering Informatics Lab at
Stanford University
39
Motivation
• Scientific literature is an important source of
information
• Retrieving relevant information from scientific
publications is challenging
• Domain terminology is used inconsistently in
scientific publications
• Increasing amounts of information amplify the
problem
• Improved methodologies based on semantics are
required
05/01/2012
Engineering Informatics Lab at
Stanford University
40
Background
• Text REtrieval Conference (TREC) organized by
NIST has showcased many successful methods
• The Genomics track focused on full-text scientific
publications from 49 prominent journals
• Methodologies involved:
–
–
–
–
Use of Synonymy from ontologies
Language based models
Query expansion and annotations
Okapi scoring model
05/01/2012
Engineering Informatics Lab at
Stanford University
41
Goals
• Understand how domain ontologies can be
leveraged
• Understand which domain ontologies can be
leveraged
• Develop a knowledge-based approach to
integrate domain knowledge with search
mechanism
05/01/2012
Engineering Informatics Lab at
Stanford University
42
Query Expansion
• TREC Queries are first manually pre-processed
“What [TUMOR TYPES] are found in zebrafish?”
=>
“[Tumor][MeSH] AND zebrafish”
• [Tumor] indicates term that has to be expanded
• [MeSH] indicates ontology that should be used
05/01/2012
Engineering Informatics Lab at
Stanford University
43
Summary
• Search methodologies must be based on
semantics in order to tackle terminology
inconsistency
• Domain ontologies provide these semantics
• Domain ontologies need to be modified (or
enriched) in order to fulfill information needs
• User interaction is important
05/01/2012
Engineering Informatics Lab at
Stanford University
44
BioPortal
• BioPortal is an integrated resource for
biomedical ontologies
• Currently indexes over 300 ontologies
including Medical Subject Headings and Gene
Ontology
• Provides a comprehensive web service,
abstracting the formats and API’s of all
underlying ontologies
05/01/2012
Engineering Informatics Lab at
Stanford University
45
Download