Schema-Driven Relationship Extraction from Unstructured Text

advertisement
Schema-Driven Relationship
Extraction from Unstructured Text
Cartic Ramakrishnan, Krys Kochut and
Amit Sheth
LSDIS Lab, University of Georgia, Athens, GA
November 7th 2006
ISWC 2006
1
Outline
•
•
•
•
Motivation
Problem Description & Approach
Results
Future Work
2
Anecdotal Example
mentioned_in
Nicolas Flammel
Harry Potter
mentioned_in
Nicolas Poussin
member_of
The Hunchback of
Notre Dame
painted_by
written_by
cryptic_motto_of
Et in Arcadia Ego
Victor Hugo
Holy Blood, Holy Grail
member_of
Priory of Sion
mentioned_in
displayed_at
member_of
The Da Vinci code
mentioned_in
painted_by
Leonardo Da Vinci
The Louvre
The Mona Lisa
painted_by
displayed_at
The Last Supper
painted_by
displayed_at
The Vitruvian man
UNDISCOVERED PUBLIC KNOWLEDGE
Discovering connections hidden in text
Santa Maria delle
Grazie
3
Motivation 1 – Undiscovered Public knowledge in biology
Stress
Migraine
?
Calcium
Channel
Blockers
Magnesium
Swanson’s
Discoveries
Spreading Cortical Depression
PubMed
These associations
were
discovered
in 1986
Associations
Discovered
based
on keyword
searches
followed by manually analysis of text to establish possible relevant relationships
4
Motivation 2 - Hypothesis Driven retrieval of
Scientific Literature
Migraine
affects
Magnesium
Stress
inhibit
Patient
isa
Calcium Channel
Blockers
Complex
Query
Keyword query: Migraine[MH] + Magnesium[MH]
PubMed
Supporting
Document
sets
retrieved
5
Motivation 3 -- Growth Rate of Public
Knowledge
• Data captured per year = 1 exabyte (1018)
(Eric Neumann, Science, 2005)
• How much is that?
– Compare it to the estimate of the total words
ever spoken by humans = 12 exabyte
• A small but significant portion is text data
– PubMed 16 Million abstracts
– MedlinePlus – health information
– OMIM – catalog of human genes and genetic
disorders
Undiscovered public knowledge may have also increased by a large
amount
6
Our past work in Connection Discovery
• Semantic Associations over RDF graphs
– Discovery and Ranking
Semantically Connected
affects
Migraine
It is therefore critical to bridge the gap between unstructured and structured data
Magnesium
by
extracting
entities
and
relationships
between
resulting
in semantic
Assumption: Rich Semantic
related
inhibitcontaining entities
isa
StressMetadata
metadata
by a diverse set
of relationships
Patient
Calcium Channel
Blockers
7
Outline
•
•
•
•
Motivation
Problem Description & Approach
Results
Future Work
8
Problem – Extracting relationships between MeSH terms from PubMed
Biologically
active substance
complicates
UMLS
Semantic Network
affects
causes
causes
Lipid
affects
instance_of
Disease or
Syndrome
instance_of
Fish Oils
???????
`
Raynaud’s Disease
MeSH
PubMed
9284
documents
5
documents
4733
documents
9
Background knowledge used
• UMLS – A high level schema of the biomedical
domain
– 136 classes and 49 relationships
– Synonyms of all relationship – using variant lookup
(tools from NLM)
T147—effect
T147—induce
– 49 relationship + their synonyms = ~350 mostly verbs
T147—etiology
• MeSH
– 22,000+ topics organized as a forest of 16 trees
– Used to query PubMed
T147—cause
T147—effecting
T147—induced
• PubMed
– Over 16 million abstract
– Abstracts annotated with one or more MeSH terms
10
Method – Parse Sentences in PubMed
SS-Tagger (University of Tokyo)
SS-Parser (University of Tokyo)
• Entities (MeSH terms) in sentences occur in modified forms
• “adenomatous”
modifies
“hyperplasia”
(TOP (S
(NP (NP (DT An)
(JJ excessive)
(ADJP (JJ endogenous) (CC or) (JJ
• “An excessive
endogenous
or exogenous
modifies
exogenous)
) (NN stimulation)
) (PP
(IN by) (NPstimulation”
(NN estrogen)
) ) ) (VP (VBZ
“estrogen”
induces)
(NP (NP (JJ adenomatous) (NN hyperplasia) ) (PP (IN of) (NP (DT
• Entities
can also occur) as
of 2 or more other entities
the)
(NN endometrium)
) ) composites
)))
• “adenomatous hyperplasia” and “endometrium” occur as “adenomatous
hyperplasia of the endometrium”
11
Method – Identify entities and Relationships
in Parse Tree
Modifiers
Modified entities
Composite Entities
TOP
S
VP
NP
VBZ
PP
NP
DT
the
JJ
excessive
JJ
endogenous
IN
by
ADJP
NP
induces
NN
estrogen
NP
NN
stimulation
JJ
adenomatous
CC
or
PP
NN
hyperplasia
IN
of
NP
JJ
exogenous
DT
the
NN
endometrium
12
Entities – The simple, the modified and
the composite
• To capture the various types of entities we define
– Simple entities as MeSH terms
– Modifiers as siblings of entities that are
• Determiners – “Y induces no X”
• Noun Phrases – “An excessive endogenous or exogenous
stimulation”
• Adjective phrases – “adenomatous”
• Prepositional phrases – “M is induced by the X in the Z”
– Modified Entities as any entity that has a sibling which is
a modifier
– Composite Entity as any entity that has another entity
as a sibling
13
Resulting RDF
adenomatous
hyperplasia
hasModifier
hasPart
modified_entity2
An excessive
endogenous or
exogenous stimulation
hasModifier
hasPart
modified_entity1
induces
composite_entity1
hasPart
hasPart
estrogen
Modifiers
Modified entities
Composite Entities
endometrium
14
Outline
•
•
•
•
Motivation
Approach
Results
Future Work
15
Results
• Dataset 1
– Swanson’s discoveries
• Associations between Migraine and Magnesium [Hearst99]
–
–
–
–
–
stress is associated with migraines
stress can lead to loss of magnesium
calcium channel blockers prevent some migraines
magnesium is a natural calcium channel blocker
spreading cortical depression (SCD) is implicated in some
migraines
– high levels of magnesium inhibit SCD
– migraine patients have high platelet aggregability
– magnesium can suppress platelet aggregability
16
Results – Creation of Dataset 1
• Keywords pairs e.g. stress + migraine etc.
against PubMed return PubMed abstracts that are
annotated (by NLM) with both terms
• 8 pairs of terms in this scenario result in 8
subsets of PubMed
• Semantic Metadata
–
–
–
–
Represented in RDF
With complex entities and relationships connecting them
Pointers to original document and sentence
Size
• ~2MB RDF for Migraine Magnesium subset of PubMed
17
Evaluating the Result of Extraction
• Ideal method to evaluate the Extraction
method
– Domain experts read a set of abstract given a
set of relationship names and entities to look
for
– In addition to this give them the extracted
triples and entities
– For every abstract the expert judges counts
the correct, incorrect and missed triples
– Measure precision and recall
18
Evaluating the Result of Extraction
• In the absence of a domain expert we
focus of getting a feel for the utility of the
extracted data
– We know the association manually discovered
between Migraine and Magnesium
– We locate paths of various lengths between
them and manually inspect these paths
– If the paths are indicative of the manually
discovered associations the extracted data is
useful
19
Paths between Migraine and Magnesium
Paths are considered interesting if they have one or more named relationship
Other than hasPart or hasModifiers in them
20
An example of such a path
stimulated
migraine
(D008881)
platelet
(D001792)
collagen
(D003094)
hasPart
hasPart
magnesium
(D008274)
stimulated
hasPart
caused_by
me_2286
_13%_and_17%_adp_and_collagen_induced_platelet_aggregation
me_3142
by_a_primary_abnormality_of_platelet_behavior
21
Results
• Dataset 2
– Neoplasm (C04)
• For subtree of MeSH rooted at Neoplasms all topics under this
subtree are used as query terms against PubMed
• The resulting dataset contains ~500,000 PubMed abstracts
• The extraction process run on this data returns ~150MB
• Processing the tagged and parsed sentences for Dataset 2
(Neoplasm) to generate RDF took approx. 5 minutes
• Stats
– 211 different named relationships found
– 500,000 instance-property-instance statements
– 260,000 instance-property-literal statements
• Currently setting up to extract RDF from all of PubMed
22
Outline
•
•
•
•
Motivation
Problem Description & Approach
Results
Future Work
23
Future Extensions to the Extraction
process
• Short-term goals (1 month)
– MeSH qualifiers (blood pressure, contraindications)
– Curate and release Migraine-Magnesium RDF
• Long-Term goals
– More complex structures
• Conjunctions
• X causes Y to inhibit Z
– Rule-action language to test new extraction rules
– Finding new terms to enrich existing vocabularies
– Perhaps ontology enrichment
24
The projected future of research in
Biology
From …
Hypothesis driven “wet lab” experiments
To …
Data-driven reduction/pruning of
hypothesis space leading to new
insight and possibly discovery
• What challenges does this transition bring?
25
Use of Generated Semantic Metadata
• Semantic Browsing of PubMed based on
named relationships between MeSH terms
• Path/hypothesis based document retrieval
• Knowledge discovery from literature
– Coprus-based complex relationship discovery
and ranking
– Corpus-based relevant connection subgraph
discovery
26
Support such retrieval and discovery operations across
multiple data sources
•Extract Semantic Metadata about
entities in all of these databases
that might occur in PubMed text
•Resulting metadata will contain
relationships between genes
(OMIM), diseases (MeSH),
nucleotide anomalies (SNP)
•hypothesis validation and
knowledge discovery in biology.
27
THANK YOU!
28
Download