Beyond Genes, Proteins, and Abstracts: A Framework to Capture Scientific Claims

advertisement
Beyond Genes, Proteins, and Abstracts:
A Framework to Capture Scientific Claims
Catherine Blake
School of Information and Library Science
University of North Carolina at Chapel Hill
http://www.ils.unc.edu/~cablake
cablake@email.unc.edu
Motivation
• Relentless increase in electronically available text
– Life Sciences
• The NLM added the 17 millionth entry to PubMed in April 2007
• 5,200 journals indexed
• 12,000 new articles each week !
– Chemistry – more than 110,000 articles in 1 year alone
• Consequences:
– Hundreds of thousands of relevant articles
– Implicit connections between literature go unnoticed
Shift from Retrieval to Synthesis
2
Entity Extraction
• Newspaper genre
– People, places, and organizations
– Message Understanding Conference (MUC)
• Biomedical genre
–
–
–
–
Genes and proteins
Diseases and treatments
Chemical compounds
Challenges: BioCreative , GENIA, JNLPBA
3
Relationship Extraction
• Newspaper genre
– Person moving from one company to another
• Biomedicine genre
–
–
–
–
–
genes and proteins e.g. binds, inhibits
ARBITER (Rindflesch, Rajan, & Hunter, 2000)
Geneways (Rzhetsky, et al, 2004)
relEx (Fundel, Kuffner, & Zimmer, 2007)
GENIA www-tsujii.is.s.u-tokyo.ac.jp/GENIA
4
Causal Relationships
• Newspaper genre
– Causal relationships (Khoo, Chan, & Niu, 1998)
• Biomedical genre
– Causes and treats (Price & Delcambre, 2005)
– Causal knowledge (Khoo, Chan, Niu, 2000)
• Universal Grammar
– Causatives (Comrie, 1974, 1981)
– Action verbs (Thomson, 1987)
5
Claim Definition
• “To assert in the face of possible contradiction”
• Example sentence reporting a claim
– “This study showed that Tamoxifen reduces the
breast cancer risk”
• Example Claim Framework
– Tamoxifenagent
– reduceschange
– [breast cancer risk] object
6
Goals
• Create a Framework that reflects how
claims made in biomedical literature
• The Framework should
– generalize beyond biomedicine
– differentiate between different levels of
confidence in the claim
– consider claims made in the full text
• Populate the Framework automatically
7
The Claim Framework
• Information facets
– concepts
– change
– basis of the claim
• Each information facet may have
– modifiers
– directionality
8
The Claim Framework
1. Explicit Claim
Agent
Object
Nature of
change
Required
2. Implicit Claim
Agent
Object
Optional
Optional
Required
Required
N/A
Required
Required
Required
Required
Required
Required
Optional
Required
Optional
Category
3. Correlation
4. Comparison
5. Observation
Concept A Concept B
Claim
Basis
Optional
9
Explicit Claims
Indeed, glycine prevented Wy-14643-stimulated
superoxide production by Kupffer cells.
Claim 1
– glycineagent
– preventedchange
– [Wy-14643-stimulated superoxide production]object
Claim 2
– [Kupffer cells]agent
– produceschange
– [Wy-14643-stimulated superoxide]object.
10
Implicit Claims
In liver the number of peroxisomes increases
from about 500-600/cell to > 5000/cell after
exposure to peroxisome proliferators.
Claim 1
–
–
–
–
[Peroxisomes proliferators] agent
increaseschangeDirection
Peroxisomesobject
[In the liver]agentModifier
– [number]agentModifier
11
Correlations
A weak but statistically significant correlation
was observed between the plasma nm23-H1
level and the WBC count (Figure 1, n=102,
r=0.437, P<0.0001)
–
–
–
–
[plasma nm23-H1 level] agent
[WBC count] object
correlation change
[statistically significant] changeModifier
12
Comparisons
The plasma concentration of nm23-H1 was
higher in patients with AML than in normal
controls (P = .0001)
Claim 1
–
–
–
–
[plasma concentration of nm23-H1] basis of claim
[Patients with AML]agent
higher changeDirection
[normal controls]object
13
Observations
However, the plasma nm21-H1 protein level
was increased in SML-M3 patients (P=.0002)
Claim 1
– [nm21-H1 protein level]object
– IncreasedchangeDirection
– [SML-M3 patients]objectModifier
14
Working Hypothesis 1
The Claim Framework reflects how a scientist
communicates her findings
– Full text documents randomly selected from
biomedical literature will report findings using
constructs within the Claim Framework
– Human annotators will agree on facets within
the Claim Framework
– The Claim Framework will generalize to a
variety of scientific literatures
15
Working Hypothesis 2
Facets within the Claim Framework can be
populated automatically
– The system will detect all claims identified by the
human annotators (i.e. recall)
– The system will only identify claims that were
identified by the human annotators (i.e. precision)
– The system design will generalize to new
literatures by avoiding domain specific constructs
16
Validating the Claim Framework
• Draft Claim Framework given to two annotators
• Pilot Study: Identify every claim
– Include claims that don’t conform to the framework
– Don’t consider how this will be automated
17
Validating the Claim Framework
• Main study
– 25 articles
• Verification
– Random set of
sentences
annotated twice
– Feedback
provided daily
18
Results
• All documents
–
–
–
–
–
Total number of sentences: 5535
Sentences with >=1 claim: 1250 (22.6%)
Total number of claims: 3228
Average claims per sentence: 2.51
Claims that did not fit in the Framework: 31
• Per document
– Average number of sentences: 191
– Average number of sentences with >=1 claim:43
19
Distribution of Claim Categories
Category
Total (%)
Pilot(%)
Explicit
2489
Implicit
87
2.70
3
0.75
84
2.98
Observation
298
9.23
24
6.03
274
9.73
Correlation
174
5.39
12
3.02
162
5.75
Comparison
165
5.11
27
6.85
138
4.9
100 398
100
2830
100
Total
3228
77.11 332
83.42
Main(%)
2157 76.63
20
Annotation
Agent
Agent Direction
Agent Modifier
Object
Object Direction
Object Modifier
Change
Change Direction
Change Modifier
Claim Basis
Claim Basis Dir.
Claim Basis Mod.
Total
All Documents
Total (%)
Words (Avg)
2894
89.65
5221
1.80
285
8.83
291
1.02
1246
38.60
4448
3.57
3197
99.04
6849
2.14
271
8.40
283
1.04
1561
48.36
5383
3.44
1897
58.77
1953
1.03
1337
41.42
1358
1.02
1147
35.53
1618
1.41
165
5.11
394
2.39
42
1.30
43
1.02
86
2.66
266
3.09
21
3228
28107
8.70
Inter Annotator Agreement
Information Facet
Agent
Object
Change
Change+ChangeDir
Kappa
0.71
0.77
0.57
0.88
Agreement
substantial
substantial
moderate
almost perfect
22
Location of Claims
Section
Abstract
Introduction
Method
Result
Discussion
Total
Total Sentences
With
%
%
Claim Total section claim
98
309
31.72
7.84
357
979
36.47 28.56
6 1121
0.54
0.48
293 1829
16.02 23.44
539 1406
38.34 43.12
1250 5535
22.58 100.00
23
Findings thus far
• 99% of the claims made in these articles
could be captured in the Claim Framework
• 22% of sentences report at least 1 claim
• 77% of the claims identified were explicit
• 8% of claims are made in the abstract
• Agreement
– substantial between agents and objects
– almost perfect for change and change direction
24
Acknowledgements
– This project supported in part by
– Renaissance Computing Institute (RENCI)
Faculty Fellowship Program
– NSF Center for Environmentally Responsible
Solvents and Processes (CERSP CHE-9876674)
– This project used resources provided by
– the OSG, which is supported by the NSF & the
U.S. Department of Energy's Office of Science
• The speaker thanks
• Nassib Nassar and Mats Rynge (RENCI)
• Amol Bapat and Ryan Jones (SILS)
Questions and Comments
Welcome
Catherine Blake
cablake@email.unc.edu
http://www.ils.unc.edu/~cablake
School of Information and Library Science
University of North Carolina at Chapel Hill
Download