ppt - UIC - Computer Science - University of Illinois at Chicago

advertisement
UIC at TREC 2006: Genomics Track
Wei Zhou, Clement T. Yu
University of Illinois at Chicago
Nov. 16, 2006
3 stages

Stage 1: Conversion
- Greek letters  English words

Stage 2: Paragraph retrieval
- retrieve 2,000 most relevant paragraphs

Stage 3: Passage extraction and ranking
- extract and retrieve 1,000 most relevant passages
Stage 1: conversion

Convert the Greek letters into English
words, for example,
TGF β1

TGF beta1
(β, in the HTML documents, may be
represented by “&#223” or “beta.gif”)
Stage 2: paragraph retrieval

The goal of this stage is to retrieve 2,000
most relevant paragraphs.

Several techniques are utilized:





1.
2.
3.
4.
5.
conditional porter stemming
gene symbol lexical variants handling
concept retrieval IR model
query expansion
abbreviation correction.
Stage 2: paragraph retrieval
- conditional Porter stemming

Potential errors of the Porter stemmer


Type
e.g.,
Type
e.g.,
1: gene symbol  non-gene word
“Pes”  “Pe”, “IDE”  “ID”
2: non-gene word  gene symbol
“IDEE”  “IDE”
solution: a table (Entrez gene database)
containing all the gene symbols is
maintained.
Stage 2: paragraph retrieval
- handling lexical variants of gene symbols

2 strategies:


Strategy 1: automatically generate lexical
variants (Buttcher, 2004; Huang, 2005).
e.g., PLA2  PLA 2, PLAII, and PLA II
Strategy 2: retrieve additional lexical
variants from a term database of MEDLINE
(Zhou, 2006).
e.g., PLA2  PL-A2
Note: PLA2: Phospholipase A2
Stage 2: paragraph retrieval
- concept retrieval (IR model)
Definition:
A concept is a biomedical meaning or
sense.
1) a gene and its synonym set refer to
the same concept;
2) a MeSH and its synonym set refer to
the same concept.

Stage 2: paragraph retrieval
- concept retrieval (IR model)
Assumption: Okapi does not work well if the query
contains multiple concepts. For example:
q: “role of gene PRNP in mad cow disease.”
concept 1
concept 2
d1: has many occurrences of concept 2
d2: has small number of occurrences of both
concepts
Okapi: sim(q,d1)>sim(q,d2), but intuitively d2 is
more relevant than d1.
Stage 2: paragraph retrieval
- concept retrieval (IR model)
According to our model (Liu, 2004; UIC Robust
track, 2005) , we have: sim(q, d2)  sim(q, d1)
because:
sim(q,d2)  sim(q,d1) although,
concept
concept
sim(q, d2)  sim(q, d1)
word
word
sim(q,d) includes both concept 1 & concept 2
concept
Stage 2: paragraph retrieval
- query expansion




Synonyms
Hyponyms (more specific terms)
Pseudo-feedback
Related terms
Stage 2: paragraph retrieval
- query expansion using biomedical knowledge

Related terms (Co-occur frequently &
related semantically)
q: How do interactions between HNF4 and COUP-TF1 suppress liver
function"
related terms
Hepatocytes
Liver
Hepatoblastoma
Gluconeogenesis
HNF4 and
COUP-tf I
Hepatitis B virus
There exists relationships between the semantic type of a related
term and the semantic type of each query concept in UMLS
semantic network.
Stage 2: paragraph retrieval
- avoid incorrect match of abbreviations

Given a query with both an abbreviation of a gene
symbol and its full form, a document will match the
term only if both its abbreviation and its full form are
matched. For example,
q: role of APC (adenomatous polyposis coli) in colon cancer?
d: “…Much work has been undertaken in recent decades with the aim of
producing projections of future cancer incidence and mortality rates
from observed rates by using age-period-cohort (APC) models…”
Notice that gene symbols are usually abbreviations,
which are very ambiguous in the biomedical literature.
Stage 3: passage extraction and ranking

The goal of this stage is to take the
output of stage 2 (i.e., 2,000 most
relevant paragraphs) and identify the
1,000 most relevant passages (i.e., one
or more consecutive sentences within
paragraphs).
Stage 3: passage extraction and ranking
- extraction

The criterion for the optimal passage in
a paragraph is given by:
“Given various windows of different sizes,
choose the one which has the maximum
number of query concepts and the smallest
size.”
Stage 3: passage extraction and ranking
- ranking

The ranking of passages is similar to
the ranking of paragraphs. For each
passage, we computed its concept
similarity and word similarity with the
query. Then the concept retrieval model
is applied for the ranking.
Experiment results

3 runs:
 UICgen1: the top 1,000 most
relevant paragraphs were returned as
the passages.


UICgen2: the top 1,000 optimal
passages according to the criterion
were returned (some bugs).
UICgen3: same as UICgen2, except
the bugs were removed.
Experiment results
UICgen1
UICgen2
UICgen3
Document
MAP # best # > Median
0.5439
3
25
0.5268
2
25
0.5320
3
25
UICgen1
UICgen2
UICgen3
Passage
MAP # best # > Median
0.0750
0
25
0.1243
0
25
0.1479
7
25
UICgen1
UICgen2
UICgen3
Aspect
MAP # best # > Median
0.4411
7
25
0.3478
1
23
0.3492
1
24
Reference





Buttcher S, Clarke CLA, Cormack GV: Domain-specific synonym
expansion and validation for bio-medical information retrieval (MultiText
experiments for TREC 2004). The Thirteenth Text REtrieval Con-ference
(TREC 2004) Proceedings, 2004, Gaithers-burg, MD.
Huang X, Zhong M, Si L. York University at TREC 2005: Genomics Track.
The Fourteenth Text RE-trieval Conference (TREC 2005) Proceedings,
2005, Gaithersburg, MD.
Zhou W, Torvik VI, Smalheiser NR. ADAM: Another Database of
Abbreviations in MEDLINE. Bioinformatics 2006; 22(22): 2813-2818.
Liu S, Liu F, Yu C, and Meng WY. An Effective Approach to Document
Retrieval via Utilizing WordNet and Recognizing Phrases. Proceedings of
the 27th Annual International ACM SIGIR Confer-ence, pp.266-272,
Sheffield, UK, July 2004.
Liu S, Yu C. UIC at TREC2005: Robust Track. The Fourteenth Text REtrieval Conference (TREC 2005) Proceedings, 2005, Gaithersburg, MD.
Questions
Thanks!
Download