ppt - UIC - Computer Science - University of Illinois at Chicago

UIC at TREC 2006: Genomics Track Wei Zhou, Clement T. Yu University of Illinois at Chicago Nov. 16, 2006 3 stages  Stage 1: Conversion - Greek letters  English words  Stage 2: Paragraph retrieval - retrieve 2,000 most relevant paragraphs  Stage 3: Passage extraction and ranking - extract and retrieve 1,000 most relevant passages Stage 1: conversion  Convert the Greek letters into English words, for example, TGF β1  TGF beta1 (β, in the HTML documents, may be represented by “&#223” or “beta.gif”) Stage 2: paragraph retrieval  The goal of this stage is to retrieve 2,000 most relevant paragraphs.  Several techniques are utilized:      1. 2. 3. 4. 5. conditional porter stemming gene symbol lexical variants handling concept retrieval IR model query expansion abbreviation correction. Stage 2: paragraph retrieval - conditional Porter stemming  Potential errors of the Porter stemmer   Type e.g., Type e.g., 1: gene symbol  non-gene word “Pes”  “Pe”, “IDE”  “ID” 2: non-gene word  gene symbol “IDEE”  “IDE” solution: a table (Entrez gene database) containing all the gene symbols is maintained. Stage 2: paragraph retrieval - handling lexical variants of gene symbols  2 strategies:   Strategy 1: automatically generate lexical variants (Buttcher, 2004; Huang, 2005). e.g., PLA2  PLA 2, PLAII, and PLA II Strategy 2: retrieve additional lexical variants from a term database of MEDLINE (Zhou, 2006). e.g., PLA2  PL-A2 Note: PLA2: Phospholipase A2 Stage 2: paragraph retrieval - concept retrieval (IR model) Definition: A concept is a biomedical meaning or sense. 1) a gene and its synonym set refer to the same concept; 2) a MeSH and its synonym set refer to the same concept.  Stage 2: paragraph retrieval - concept retrieval (IR model) Assumption: Okapi does not work well if the query contains multiple concepts. For example: q: “role of gene PRNP in mad cow disease.” concept 1 concept 2 d1: has many occurrences of concept 2 d2: has small number of occurrences of both concepts Okapi: sim(q,d1)>sim(q,d2), but intuitively d2 is more relevant than d1. Stage 2: paragraph retrieval - concept retrieval (IR model) According to our model (Liu, 2004; UIC Robust track, 2005) , we have: sim(q, d2)  sim(q, d1) because: sim(q,d2)  sim(q,d1) although, concept concept sim(q, d2)  sim(q, d1) word word sim(q,d) includes both concept 1 & concept 2 concept Stage 2: paragraph retrieval - query expansion     Synonyms Hyponyms (more specific terms) Pseudo-feedback Related terms Stage 2: paragraph retrieval - query expansion using biomedical knowledge  Related terms (Co-occur frequently & related semantically) q: How do interactions between HNF4 and COUP-TF1 suppress liver function" related terms Hepatocytes Liver Hepatoblastoma Gluconeogenesis HNF4 and COUP-tf I Hepatitis B virus There exists relationships between the semantic type of a related term and the semantic type of each query concept in UMLS semantic network. Stage 2: paragraph retrieval - avoid incorrect match of abbreviations  Given a query with both an abbreviation of a gene symbol and its full form, a document will match the term only if both its abbreviation and its full form are matched. For example, q: role of APC (adenomatous polyposis coli) in colon cancer? d: “…Much work has been undertaken in recent decades with the aim of producing projections of future cancer incidence and mortality rates from observed rates by using age-period-cohort (APC) models…” Notice that gene symbols are usually abbreviations, which are very ambiguous in the biomedical literature. Stage 3: passage extraction and ranking  The goal of this stage is to take the output of stage 2 (i.e., 2,000 most relevant paragraphs) and identify the 1,000 most relevant passages (i.e., one or more consecutive sentences within paragraphs). Stage 3: passage extraction and ranking - extraction  The criterion for the optimal passage in a paragraph is given by: “Given various windows of different sizes, choose the one which has the maximum number of query concepts and the smallest size.” Stage 3: passage extraction and ranking - ranking  The ranking of passages is similar to the ranking of paragraphs. For each passage, we computed its concept similarity and word similarity with the query. Then the concept retrieval model is applied for the ranking. Experiment results  3 runs:  UICgen1: the top 1,000 most relevant paragraphs were returned as the passages.   UICgen2: the top 1,000 optimal passages according to the criterion were returned (some bugs). UICgen3: same as UICgen2, except the bugs were removed. Experiment results UICgen1 UICgen2 UICgen3 Document MAP # best # > Median 0.5439 3 25 0.5268 2 25 0.5320 3 25 UICgen1 UICgen2 UICgen3 Passage MAP # best # > Median 0.0750 0 25 0.1243 0 25 0.1479 7 25 UICgen1 UICgen2 UICgen3 Aspect MAP # best # > Median 0.4411 7 25 0.3478 1 23 0.3492 1 24 Reference      Buttcher S, Clarke CLA, Cormack GV: Domain-specific synonym expansion and validation for bio-medical information retrieval (MultiText experiments for TREC 2004). The Thirteenth Text REtrieval Con-ference (TREC 2004) Proceedings, 2004, Gaithers-burg, MD. Huang X, Zhong M, Si L. York University at TREC 2005: Genomics Track. The Fourteenth Text RE-trieval Conference (TREC 2005) Proceedings, 2005, Gaithersburg, MD. Zhou W, Torvik VI, Smalheiser NR. ADAM: Another Database of Abbreviations in MEDLINE. Bioinformatics 2006; 22(22): 2813-2818. Liu S, Liu F, Yu C, and Meng WY. An Effective Approach to Document Retrieval via Utilizing WordNet and Recognizing Phrases. Proceedings of the 27th Annual International ACM SIGIR Confer-ence, pp.266-272, Sheffield, UK, July 2004. Liu S, Yu C. UIC at TREC2005: Robust Track. The Fourteenth Text REtrieval Conference (TREC 2005) Proceedings, 2005, Gaithersburg, MD. Questions Thanks!

ppt - UIC - Computer Science - University of Illinois at Chicago

Related documents

Products

Support

ppt - UIC - Computer Science - University of Illinois at Chicago

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib