ppt

advertisement
An Effective Approach to
Document Retrieval via Utilizing
WordNet and Recognizing Phrases
Shuang Liu, Fang Liu, Clement Yu
University of Illinois at Chicago
Weiyi Meng
Binghamton University
Abstract





Noun phrases in queries are identified and classified into
four types.
A document has a phrase if all content words in the
phrase are within a window of a certain size.
The window sizes for different types of phrases are
different and are determined using a decision tree.
Documents in response to a query are ranked with
matching phrases given a higher priority.
We utilize WordNet to disambiguate word senses of
query terms and expand query.
2
Introduction



Techniques involving phrases, general dictionaries (i.e. WordNet)
and WSD have been investigated with mixed results in the past.
Noun phrases are classified into four types:
 Proper names
 Dictionary phrases
 Simple phrases, which do not have any embedded phrases
 Complex phrases, which are more complicated phrases.
A document has a phrase if all the content words in the phrase are
within a window of a certain size, which depends on the type of the
phrase.
(i.e. the window size for a simple phrase is smaller than that for a
complex phrase.)
3
Introduction

A phrase is significant to be used if
1.
2.


it is a proper name or a dictionary phrase, or
it is a simple or complex phrase whose content words are
highly positively correlated.
The similarity measure between a query and a document has
two components (phrase-sim, term-sim).
Documents are ranked in descending order of (phrase-sim,
term-sim).
4
Introduction



For adjacent query words, the following information from
WordNet is utilized to do WSD: synonym sets, hyponym sets,
and their definitions.
When the sense of a query word is determined, its synonyms,
hyponyms, words or phrases from its definition, and its
compound words are considered for possible addition to the
query.
We also impose pseudo-feedback by considering positively
globally correlated (to a query term/phrase) terms.
5
Phrases Identification




Noun phrases in a query are classified into one of the four
types mentioned.
Brill’s tagger is used to assign POS tags to words in a query.
A document has a phrase if the content words in the phrase
appear within a window of a certain size.
The window sizes are learned from some training data.
6
Determining Phrases Types

Proper names include names of people, places and organizations,
and are recognized by the named entity recognizer Minipar [Lin94].

Dictionary phrases are phrases that can be found in dictionaries
such as WordNet.

A simple phrase has no embedded noun phrase, has two to four
words, but at most two of them are content words.
Ex: “school uniform”

A complex phrase either has one or more dictionary phrases
and/or simple phrases embedded in it, or there is a noun involved in
some complicated way with other words in the phrase. Ex:
“instruments of forecast weather”
7
Determining Window Sizes


For each of the four types of phrases, the required window size
varies.
Intuitively, the content words of a dictionary phrase should be close
together (ex: within 3 words apart), but this might not be true in
practice.
Ex: the query ”drugs for mental illness” contains “mental illness”
a relevant document in a TREC collection contains a fragment
“… was hospitalized for mental problems… and had been on lithium for his
illness until recently”

Suitable distances between content words of different types of
phrases should be learned.
8
Determining Window Sizes
1.
2.
3.
4.
For a set of training queries, we identify the types of phrases, and
the distances of the content words of the phrases in all of the
relevant documents and in the irrelevant having high similarities
with the queries.
The information is fed into a decision tree (C4.5), which produces a
proper distance for each type of phrases.
Learning results:
proper name: 0
simple phrase: 48
dictionary phrases: 16
complex phrase: 78
These window sizes are obtained when TREC9 queries are trained
on TREC WT10G data collection.
9
Determining Significant Phrases


Proper names and dictionary phrases are assumed to be significant.
A simple phrase or a complex phrase is significant if the content
words within the phrase are highly positively correlated.
P( phrase ) 
correlatio n(t1, t 2,..., tn) 
 P(t )
i
ti phrase
 P(t )
i
ti phrase

If correlation(t1,t2,…,tn)>5, the phrase is significant.
10
Phrase Similarity Computation

The similarity of a document with a query has two components
(phrase-sim, term-sim)

The phrase-sim is more important than the term-sim.

Each query term in the document contributes to term-sim.

The term-sim is computed by Okapi formula.


The phrase-sim of a document is the sum of idfs of distinct significant
phrases occurs in the document.
For a complex phrase, the phrase-sim is the idf of the complex phrase
plus the idfs of the embedded significant phrases.
11
Word Sense Disambiguation
Using WordNet

Suppose two adjacent terms t1 and t2 in query form a phrase, the
sense of t1 can be determined by executing the following steps:
1.
2.
3.
If t2 or a synonym of t2 is found in the definition of a synset S of t1, S is
determined to be the sense of t1.
The definition of each synset of t1 is compared against the definition of
each synset of t2, and the combination that have the most content
words in common yields the sense for t1 and the sense for t2.
If t2 or one of its synonyms appears in the definition of a synset S
containing a hyponym of t1, then the sense of t1 is the synset S1 which
has the descendant S.
12
Word Sense Disambiguation
Using WordNet (cont.)
4.
5.
6.
If a synset S1 of t1 have a hyponym synset U that contains a term h,
which appears in the definition of a synset S2 of t2, then the senses of
t1 and t2 are S1 and S2, respectively.
If all the preceding steps fail, consider the surrounding terms in the
query.
If t1 has a dominant synset S, then the sense of t1 is S.
13
Query Expansion Using
WordNet

The sense of query term t1 is determined to be the synset S, the
following four cases are considered for QE.
1.
2.
Add Synonyms:
For any term t’ in S, t’ is added to the query if either
(a) S is a dominant synset of t’ or (b) t’ is highly correlated with t2.
The weight of t’ is W(t’) = f(t’, S) / F(t’)
Add Definition Words:
If t1 is a single sense word, the first shortest noun phrase from the
definition will be added if it is highly globally correlated with t1.
3.
Add Hyponyms:
The conditions are similar to “add synonyms”.
14
Query Expansion Using
WordNet (cont.)
4.
Add Compound Words:
Suppose c is a compound word of a query term t1 and c has a
dominant synset V. The word c can be added tu the query if:
(a) The definition of V contains t1 as well as all terms that form a
phrase with t1 in the query
(b) The definition of V contains term t1, and c relates to t1
through a “member of “ relation.
15
Pseudo Relevance Feedback

Two methods are used

Using global correlations and WordNet
global_correlation(ti, s) = idf(s) × log(correlation(ti, s))
s: a query concept ti: a concept in documents


A term among the top 10 most highly globally terms is added to
the query if (1) it has a single sense and (2) its definition contains
some other query terms.
Combining local and global correlations

A term is brought in if it correlates highly with the query in the top
ranked documents and globally with a query concept in the
collection.
16
Modification of the Query and
the Similarity Function

Query expansion will result in a final Boolean query.
ex: original query phrase = (t1 AND t2), t1 bring in t1’ and t2 bring in t2’
expanded = (t1 AND t2) or (t1’ AND t2) or (t1’ AND t2’) or (t1 AND t2’)


When computing phrase-sim of a document, any occurrence of an
expanded phrase is equivalent to an occurrence of original.
The term-sim is computed based on the occurrences of t1, t2, t1’, t2’.
17
Experiment Setup

100 queries from the TREC9 and the TREC10 ad hoc queries sets
and 100 queries from the TREC12 robust queries set are used.

Only the title portion is used.

Algorithms used:


SO: the Standard Okapi formula for passage retrieval
NO: the Okapi formula is modified by replacing pl and avgpl by Norm
and AvgNorm respectively.

NO+P: phrase-sim is computed

NO+P+D: WSD techniques are used to expand the query

NO+P+D+F: pseudo-feedback techniques are employed
18
Experiment Results

The best-know results:
 0.2078 in TREC9 [OMNH00]
 0.2225 in TREC10 [AR02]
 In TREC12



0.2900 using Web data and the descriptions [Kwok03]
0.2692 using Web data and title [Yeung03]
0.2052 without Web data [ZhaiTao03]
19
Conclusions


We provide an effective approach to process typical short
queries and demonstrate that it yields significant
improvements over existing algorithms in three TREC
collections under the same experimental conditions.
We plan to make use of Web data to achieve further
improvement.
20
Download