Multiple Categorization of Search Results

advertisement

Automatically Identifying

Candidate Treatments from

Existing Medical Literature

Catherine Blake (cblake@ics.uci.edu)

Information & Computer Science

University of California, Irvine

Wanda Pratt (wpratt@u.washington.edu)

Information School and Division of Biomedical & Health Informatics

University of Washington

Motivation

Information overload

– MEDLINE = 11 million citations

– additional 8,000 each week

Specialization of research

– low communication between scientific areas

– little focus on ‘big picture’

Goal

• Provide scientists with promising new treatment strategies

Assumptions

• Medical literature has implicit links

• Deductive logic can identify these links

If A then B and If B then C then A  C

Previous Approach

Target

Literature

A

Magnesium

B-Platelet Activity

B-Calcium Channel

Blockers

B-Serotonin

...

Source

Literature

C

Migraine

Swanson and Smalheiser (1997)

Current Pruning

Remove ‘redundancies and non-useful terms’

No Pruning

Stemmed

Manual Pruning

Words Distinct Words

14,051 2,762

13,112 2,492

150 - 200

~ 92-94% of B-terms are manually pruned !

Our Approach

Semantic representation

– Unify synonymous text expressions

– e.g. Serotonin = {5-HT, 5HT, Enteramine,

5-Hydroxytryptamine, 3-(2-Aminoethyl)-

1H-indol-5-ol }

Prune using semantic types

– e.g. Serotonin is a {Organic Chemical,

Pharmacologic Substance, Neuroreactive

Substance or Biogenic Amige}

Unified Medical

Language System (UMLS)

(1) Metathesaurus

• 311 vocabularies • 776, 940 concepts

• ~11 million relationships • 2.10 million strings

(2) Semantic Network

• 134 semantic types • 54 semantic relations

(3) SPECIALIST lexicon

• POS + morphological • 163 899 entries

• 133 945 nouns • 13 179 verbs

Methodology

• Collect migraine citations

• Generate alternative features

– word

– concept

– semantically pruned concepts

• Evaluate C

B connections

Word Representation

• Domain independent

• Common choice

• Title words (to compare with

Swanson)

• Removed

– 417 generic stopwords* e.g. a, and, between, their, really, room, said, think, the, ...

– 31 medical stopwords e.g clinical, observed, provide, selection, study, therapy, test, ...

* Source: Sanderson, M. (1999) Available at http://www.dcs.gla.ac.uk/idom/ir_resources

Concept Representation

• Medical specific

• Titles mapped to UMLS concept

• Mapped automatically

(1) partition title sentences into phrases

(2) for each phrase

(2a) direct concept match (UMLS API)

(2b) if not found approx match (UMLS API) select the best concept

Semantically Pruned Concept

• Used 37 of 134 semantic types in UMLS

• Substance

• Hormone

• Chemical

• Gene or Genome

• Enzyme • Cell

• Amino Acid, Peptide or Protein

• Neuroreactive Substance or Biogenic Amine

• ...

• Goal : generalize semantic types

• not blinded to B-terms

Evaluation

Number of Relevant Items

Step 1: Find potentially relevant titles

461

– any representation + synonyms

– e.g. calcium channel blockers any word in { calcium, channel, blokers, blocker }

Step 2: Verify each title 366

– Not all relevant B-terms indicated relevant links

– E.g. Timolol maleate, a beta blocker , in the treatment of common migraine headache  calcium channel blocker

Evaluation - Metrics

(1) Precision = Number of relevant B-terms

Number of B-terms returned

(2) Recall =

Number of relevant B-terms

Number of relevant titles

(3) Number of C

B links identified

(4) Feature space dimensionality

30

20

10

0

0

60

Interpolated Precision

Interpolated Precision

50

Word

Medical Concept

Semantic Pruning

40

5 10 15

Recall(%)

20 25 30

Number of Links Identified

10

8

6

4

2

0

At 20 B-terms

At 50 B Terms

Word Concept Semantic

Pruning

Dimensionality

Word

Concept

Semantically

Pruned Concept

Distinct Terms Per Citation

2732 4.20

1811 2.10

618 0.80

Abstract

76

20

8

Future Work

• Extend to B

A connections

• Use abstracts

– dimensionality consequences

• Generalize

– Raynaud’s disease and fish oil

– other research questions

Conclusions

• Concept vs Words

• improved precision and recall

• more of the 11 connections in top 50 B-terms

• Semantic Pruning vs Concept

• degraded recall

• improved precision

• more of the 11 connections in top 50 B-terms

http://www.ics.uci.edu/~cblake

Catherine Blake (cblake@ics.uci.edu)

Wanda Pratt (wpratt@u.washington.edu)

References

• Davis, R (1989). The Creation of New Knowledge by Information Retrieval and Classification. Journal of Documentation 45(4) 273-301.

• Lindsay, R. K. and M. D. Gordon (1999). Literature-Based Discovery by

Lexical Statistics. Journal of the American Society for Information Science

50(7): 574-587.

• Sanderson, M. (1999). Stop word list. Available at: http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/

• Swanson, D. R. (1988). Migraine and magnesium: eleven neglected connections. Perspect. Biol. Med. 31: 526-557.

• Swanson, D. R. and N. R. Smalheiser (1997a). An interactive system for finding complementary literatures: a stimulus to scientific discovery.

Artifical Intelligence: 183-203.

• Weeber, M., Klein,H., Mork,J.G, Jong-van den Berg,L., Vos,R. (2000). Text-

Based Discovery in Biomedicine: The Architecture of the DAD-system.

AMIA.

Download