Catherine Blake (cblake@ics.uci.edu)
Information & Computer Science
University of California, Irvine
Wanda Pratt (wpratt@u.washington.edu)
Information School and Division of Biomedical & Health Informatics
University of Washington
Information overload
– MEDLINE = 11 million citations
– additional 8,000 each week
Specialization of research
– low communication between scientific areas
– little focus on ‘big picture’
• Provide scientists with promising new treatment strategies
• Medical literature has implicit links
• Deductive logic can identify these links
• If A then B and If B then C then A C
Target
Literature
A
Magnesium
B-Platelet Activity
B-Calcium Channel
Blockers
B-Serotonin
...
Source
Literature
C
Migraine
Swanson and Smalheiser (1997)
Remove ‘redundancies and non-useful terms’
No Pruning
Stemmed
Manual Pruning
Words Distinct Words
14,051 2,762
13,112 2,492
150 - 200
~ 92-94% of B-terms are manually pruned !
Semantic representation
– Unify synonymous text expressions
– e.g. Serotonin = {5-HT, 5HT, Enteramine,
5-Hydroxytryptamine, 3-(2-Aminoethyl)-
1H-indol-5-ol }
Prune using semantic types
– e.g. Serotonin is a {Organic Chemical,
Pharmacologic Substance, Neuroreactive
Substance or Biogenic Amige}
(1) Metathesaurus
• 311 vocabularies • 776, 940 concepts
• ~11 million relationships • 2.10 million strings
(2) Semantic Network
• 134 semantic types • 54 semantic relations
(3) SPECIALIST lexicon
• POS + morphological • 163 899 entries
• 133 945 nouns • 13 179 verbs
• Collect migraine citations
• Generate alternative features
– word
– concept
– semantically pruned concepts
• Evaluate C
B connections
• Domain independent
• Common choice
• Title words (to compare with
Swanson)
• Removed
– 417 generic stopwords* e.g. a, and, between, their, really, room, said, think, the, ...
– 31 medical stopwords e.g clinical, observed, provide, selection, study, therapy, test, ...
* Source: Sanderson, M. (1999) Available at http://www.dcs.gla.ac.uk/idom/ir_resources
• Medical specific
• Titles mapped to UMLS concept
• Mapped automatically
(1) partition title sentences into phrases
(2) for each phrase
(2a) direct concept match (UMLS API)
(2b) if not found approx match (UMLS API) select the best concept
• Used 37 of 134 semantic types in UMLS
• Substance
• Hormone
• Chemical
• Gene or Genome
• Enzyme • Cell
• Amino Acid, Peptide or Protein
• Neuroreactive Substance or Biogenic Amine
• ...
• Goal : generalize semantic types
• not blinded to B-terms
Step 1: Find potentially relevant titles
461
– any representation + synonyms
– e.g. calcium channel blockers any word in { calcium, channel, blokers, blocker }
Step 2: Verify each title 366
– Not all relevant B-terms indicated relevant links
– E.g. Timolol maleate, a beta blocker , in the treatment of common migraine headache calcium channel blocker
(1) Precision = Number of relevant B-terms
Number of B-terms returned
(2) Recall =
Number of relevant B-terms
Number of relevant titles
(3) Number of C
B links identified
(4) Feature space dimensionality
30
20
10
0
0
60
Interpolated Precision
50
Word
Medical Concept
Semantic Pruning
40
5 10 15
Recall(%)
20 25 30
10
8
6
4
2
0
At 20 B-terms
At 50 B Terms
Word Concept Semantic
Pruning
Word
Concept
Semantically
Pruned Concept
Distinct Terms Per Citation
2732 4.20
1811 2.10
618 0.80
Abstract
76
20
8
• Extend to B
A connections
• Use abstracts
– dimensionality consequences
• Generalize
– Raynaud’s disease and fish oil
– other research questions
• Concept vs Words
• improved precision and recall
• more of the 11 connections in top 50 B-terms
• Semantic Pruning vs Concept
• degraded recall
• improved precision
• more of the 11 connections in top 50 B-terms
http://www.ics.uci.edu/~cblake
Catherine Blake (cblake@ics.uci.edu)
Wanda Pratt (wpratt@u.washington.edu)
• Davis, R (1989). The Creation of New Knowledge by Information Retrieval and Classification. Journal of Documentation 45(4) 273-301.
• Lindsay, R. K. and M. D. Gordon (1999). Literature-Based Discovery by
Lexical Statistics. Journal of the American Society for Information Science
50(7): 574-587.
• Sanderson, M. (1999). Stop word list. Available at: http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/
• Swanson, D. R. (1988). Migraine and magnesium: eleven neglected connections. Perspect. Biol. Med. 31: 526-557.
• Swanson, D. R. and N. R. Smalheiser (1997a). An interactive system for finding complementary literatures: a stimulus to scientific discovery.
Artifical Intelligence: 183-203.
• Weeber, M., Klein,H., Mork,J.G, Jong-van den Berg,L., Vos,R. (2000). Text-
Based Discovery in Biomedicine: The Architecture of the DAD-system.
AMIA.