CS 430 / INFO 430 Information Retrieval Query Refinement and Relevance Feedback

advertisement
CS 430 / INFO 430
Information Retrieval
Lecture 8
Query Refinement and Relevance Feedback
1
Course Administration
Assignment Reports
A sample report will be posted before the next assignment is
due.
Preparation for Discussion Classes
Most of the readings were used last year. You can see the
questions that were used on last year's Web site:
http://www.cs.cornell.edu/Courses/cs430/2004fa/
2
CS 430 / INFO 430
Information Retrieval
Completion of Lecture 7
3
Search for Substring
In some information retrieval applications, any substring can
be a search term.
Tries, using suffix trees, provide lexicographical indexes for
all the substrings in a document or set of documents.
4
Tries: Search for Substring
Basic concept
The text is divided into unique semi-infinite strings, or
sistrings. Each sistring has a starting position in the text,
and continues to the right until it is unique.
The sistrings are stored in (the leaves of) a tree, the suffix
tree. Common parts are stored only once.
Each sistring can be associated with a location within a
document where the sistring occurs. Subtrees below a
certain node represent all occurrences of the substring
represented by that node.
Suffix trees have a size of the same order of magnitude as
the input documents.
5
Tries: Suffix Tree
Example: suffix tree for the
following words:
begin
beginning
between
bread
break
b
e
gin
null
6
rea
tween
ning
d
k
Tries: Sistrings
A binary example
String:
Sistrings:
7
01 100 100 010 111
1
2
3
4
5
6
7
8
01 100 100 010 111
11 001 000 101 11
10 010 001 011 1
00 100 010 111
01 000 101 11
10 001 011 1
00 010 111
00 101 11
Tries: Lexical Ordering
7
4
8
5
1
6
3
2
00 010 111
00 100 010 111
00 101 11
01 000 101 11
01 100 100 010 111
10 001 011 1
10 010 001 011 1
11 001 000 101 11
Unique string indicated in blue
8
Trie: Basic Concept
1
0
0
1
1
0
2
0
1
7
0
1
5
1
0
0
0
0
4
9
6
1
8
1
3
Patricia Tree
1
0
2
0
7
0
4
2 1
1
3
0
1
10
0
5
5
00
3
4
1
1
0
2
1
6
1
8
Single-descendant nodes are eliminated.
Nodes have bit number.
10
3
CS 430 / INFO 430
Information Retrieval
Lecture 8
Query Refinement and Relevance Feedback
11
Query Refinement
new query
Query formulation
reformulated
query
Search
Display retrieved information
Reformulate query
EXIT
12
Reformulation of Query
Manual
•
•
•
Add or remove search terms
Change Boolean operators
Change wild cards
Automatic
13
•
•
Remove search terms
Change weighting of search terms
•
Add new search terms
Manual Reformulation:
Vocabulary Tools
Feedback
•
Information about stop lists, stemming, etc.
•
Numbers of hits on each term or phrase
Suggestions
14
•
Thesaurus
•
Browse lists of terms in the inverted index
•
Controlled vocabulary
Manual Reformulation:
Document Tools
Feedback to user consists of document excerpts or
surrogates
•
Shows the user how the system has interpreted the query
Effective at suggesting how to restrict a search
•
Shows examples of false hits
Less good at suggesting how to expand a search
•
15
No examples of missed items
Relevance Feedback: Document
Vectors as Points on a Surface
16
•
Normalize all document vectors to be of length 1
•
Then the ends of the vectors all lie on a surface
with unit radius
•
For similar documents, we can represent parts of
this surface as a flat region
•
Similar document are represented as points that are
close together on this surface
Results of a Search
x
x

x
x
x
x
x
x documents found by search
 query
17
hits from
search
Relevance Feedback (Concept)
x
x

o
x
o
x
hits from
original
search
o
x documents identified by user as non-relevant
o documents identified by user as relevant
 original query
reformulated query
18
Theoretically Best Query
optimal
query
o
x
x
o
x
o
x
x
x
x
x
x
o
x
x
x
x
x
o
o
x
x
x
x
x non-relevant documents
o relevant documents
19
Theoretically Best Query
For a specific query, q, let:
DR be the set of all relevant documents
DN-R be the set of all non-relevant documents
sim (q, DR) be the mean similarity between query q and
documents in DR
sim (q, DN-R) be the mean similarity between query q and
documents in DN-R
The theoretically best query would maximize:
F = sim (q, DR) - sim (q, DN-R)
20
Estimating the Best Query
In practice, DR and DN-R are not known. (The objective is to
find them.)
However, the results of an initial query can be used to estimate
sim (q, DR) and sim (q, DN-R).
21
Rocchio's Modified Query
Modified query vector
= Original query vector
+ Mean of relevant documents found by original query
- Mean of non-relevant documents found by original query
22
Query Modification
q1 = q0 +
1
n1
n1
 ri
i =1
-
1
n2
n2
 si
i =1
q0 = vector for the initial query
q1 = vector for the modified query
ri = vector for relevant document i
si = vector for non-relevant document i
n1 = number of relevant documents
n2 = number of non-relevant documents
Rocchio 1971
23
Difficulties with Relevance Feedback
optimal
query
o
x
x
o
x
x
x
x
x
x
24
x
o
o
x
x

o
x
x
o
x
x
x
x
x
x non-relevant documents
o relevant documents
 original query
reformulated query
Hits from
the initial
query are
contained in
the gray
shaded area
Difficulties with Relevance Feedback
optimal
results
x
set
x
o
o
x
x
What region
provides the
optimal
results set?
x
x
x
x
25
x
o
o
x
x

o
x
x
o
x
x
x
x
x
x non-relevant documents
o relevant documents
 original query
reformulated query
Effectiveness of Relevance Feedback
Best when:
26
•
Relevant documents are tightly clustered (similarities
are large)
•
Similarities between relevant and non-relevant
documents are small
When to Use Relevance Feedback
Relevance feedback is most important when the user wishes to
increase recall, i.e., it is important to find all relevant
documents.
Under these circumstances, users can be expected to put effort
into searching:
27
•
Formulate queries thoughtfully with many terms
•
Review results carefully to provide feedback
•
Iterate several times
•
Combine automatic query enhancement with studies of
thesauruses and other manual enhancements
Adjusting Parameters 1:
Relevance Feedback
1
q1 =  q0 +  n
1
n1
1
r

 i
n2
i =1
n2
 si
i =1
,  and  are weights that adjust the importance
of the three vectors.
If  = 0, the weights provide positive feedback,
by emphasizing the relevant documents in the
initial set.
If  = 0, the weights provide negative feedback,
by reducing the emphasis on the non-relevant
documents in the initial set.
28
Adjusting Parameters 2:
Filtering Incoming Messages
D1, D2, D3, ... is a stream of incoming documents that are to be
divided into two sets:
R - documents judged relevant to an information need
S - documents judged not relevant to the information need
A query is defined as the vector in the term vector space:
q = (w1, w2, ..., wn)
where wi is the weight given to term i
Dj will be assigned to R if similarity(q, Dj) > 
What is the optimal query, i.e., the optimal values of the wi?
29
Seeking Optimal Parameters
Theoretical approach
Develop a theoretical model
Derive parameters
Test with users
Heuristic approach
Develop a heuristic
Vary parameters
Test with users
Machine learning approach
30
Seeking Optimal Parameters using
Machine Learning
GENERAL:
EXAMPLE: Text Retrieval
Input:
• training examples
• design space
Input:
• queries with relevance judgments
• parameters of retrieval function
Training:
Training:
• automatically find the solution • find parameters so that many
in design space that works well relevant documents are ranked
on the training data
highly
Prediction:
• predict well on new examples
31
Prediction:
• rank relevant documents high
also for new queries
Joachims
Machine Learning: Tasks and
Applications
Task
Application
Text Routing
Help-Desk Support:
Who is an appropriate expert for a particular problem?
Information
Filtering
Information Agents:
Which news articles are interesting to a particular
person?
Relevance
Feedback
Information Retrieval:
What are other documents relevant for a particular
query?
Text
Knowledge Management:
Categorization Organizing a document database by semantic
categories.
32
Learning to Rank
Assume:
• distribution of queries P(q)
• distribution of target rankings for query P(r | q)
Given:
• collection D of documents
• independent, identically distributed training sample (qi, ri)
Design:
• set of ranking functions F
• loss function l(ra, rb)
• learning algorithm
Goal:
• find f  F that minimizes l(f (q), r) integrated across all queries
33
A Loss Function for Rankings
For two orderings ra and rb, a pair is:
• concordant, if ra and rb agree in their ordering
P = number of concordant pairs
• discordant, if ra and rb disagree in their ordering
Q = number of discordant pairs
Loss function:
l(ra, rb) = Q
Example:
ra = (a, c, d, b, e, f, g, h)
rb = (a, b, c, d, e, f, g, h)
34
The discordant pairs are: (c, b), (d, b)
l(ra, rb) = 2
Joachims
Machine Learning: Algorithms
The choice of algorithms is a subject of active research, which is
covered in several courses, notably CS 478 and CS/INFO 630.
Some effective methods include:
Naive Bayes
Rocchio Algorithm
C4.5 Decision Tree
k-Nearest Neighbors
Support Vector Machine
35
Relevance Feedback:
Clickthrough Data
Relevance feedback methods have suffered from the unwillingness
of users to provide feedback.
Joachims and others have developed methods that use
Clickthrough data from online searches.
Concept:
Suppose that a query delivers a set of hits to a user.
If a user skips a link a and clicks on a link b ranked lower,
then the user preference reflects rank(b) < rank(a).
36
Clickthrough Example
Ranking Presented to User:
1. Kernel Machines
http://svm.first.gmd.de/
User clicks on 1, 3 and 4
2. Support Vector Machine
http://jbolivar.freeservers.com/
3. SVM-Light Support Vector Machine
http://ais.gmd.de/~thorsten/svm light/
4. An Introduction to Support Vector Machines
http://www.support-vector.net/
5. Support Vector Machine and Kernel ... References
http://svm.research.bell-labs.com/SVMrefs.html
37
Ranking: (3 < 2) and (4 < 2)
Joachims
Download