CS 430: Information Discovery Probabilistic Information Retrieval Lecture 17 1

advertisement
CS 430: Information Discovery
Lecture 17
Probabilistic Information Retrieval
1
Course Administration
Midterm Examination
Kimball B11, 7:30 to 9:00 pm on Wednesday, October 31.
Assignment 3
Revised version now online:
2
•
Clarifies requirements, e.g., precedence of operators,
stemming with wild cards, etc.
•
Detailed submission requirements, so that we can better
grade and comment on your work.
Three Approaches to Information
Retrieval
Many authors divide the methods of information retrieval into
three categories:
Boolean (based on set theory)
Vector space (based on linear algebra)
Probabilistic (based on Bayesian statistics)
In practice, the latter two have considerable overlap.
3
Probability Ranking Principle
"If a reference retrieval system’s response to each request is a
ranking of the documents in the collections in order of
decreasing probability of usefulness to the user who submitted
the request, where the probabilities are estimated as accurately
a possible on the basis of whatever data made available to the
system for this purpose, then the overall effectiveness of the
system to its users will be the best that is obtainable on the
basis of that data."
W.S. Cooper
4
Probabilistic Ranking
Basic concept:
"For a given query, if we know some documents that are
relevant, terms that occur in those documents should be given
greater weighting in searching for other relevant documents.
By making assumptions about the distribution of terms and
applying Bayes Theorem, it is possible to derive weights
theoretically."
Van Rijsbergen
5
Probability Theory -- Bayesian Formulas
Notation
Let a, b be two events.
P(a | b) is the probability of a given b
Bayes Theorem
P(a | b) =
P(a | b) =
P(b | a) P(a)
P(b)
P(b | a) P(a)
where a is the event not a
P(b)
Derivation
P(a | b) P(b) = P(a  b) = P(b | a) P(a)
6
Concept
R is a set of documents that are guessed to be relevant and R
the complement of R.
1. Guess a preliminary probabilistic description of R and
use it to retrieve a first set of documents.
2. Interact with the user to refine the description.
3. Repeat, thus generating a succession of approximations
to R.
7
Probabilistic Principle
Given a user query q and a document dj the model estimates the
probability that the user finds dj relevant. i.e., P(R | dj).
P(R | dj)
similarity (dj, q) =
P(R | dj)
P(dj | R) P(R)
=
P(dj | R) P(R)
=
8
P(dj | R)
x constant
P(dj | R)
by Bayes Theorem
Binary Independence Retrieval Model
(BIR)
Suppose that the weights for term i in document dj and query
q are wi,j and wi,q, where all weights are 0 or 1.
Let P(ki | R) be the probability that index term ki is present in
a document randomly selected from the set R.
If the index terms are independent, after some mathematical
manipulation, taking logs and ignoring factors that are
constant for all documents:
similarity (dj, q)
=
9
 wi,q x wi,j x ( log
i
P(ki | R)
1 - P(ki | R)
+ log
)
1 - P(ki | R)
P(ki | R)
Estimates of P(ki | R)
Initial guess, with no information to work from:
P(ki | R) = c
P(ki | R) = ni / N
where:
c is an arbitrary constant, e.g., 0.5
ni is the number of documents that contain ki
N is the total number of documents in the collection
10
Improving the Estimates of P(ki | R)
Human feedback -- relevance feedback
Automatically
(a) Run query q using initial values. Consider the t top ranked
documents. Let r be the number of these documents that
contain the term ki.
(b) The new estimates are:
P(ki | R) = r / t
P(ki | R) = (ni - r) / (N - t)
Note: The ratio of these two terms, with minor changes of
notation and taking logs, gives w2 on page 368 of Frake.
11
Continuation
similarity (dj, q)
P(ki | R)
1 - P(ki | R)
=  wi,q x wi,j x ( log
+ log
)
i
1 - P(ki | R)
P(ki | R)
=  wi,q x wi,j x
i
( log r/(t - r) + log (N - r)/(N + r - t - ni) )
=  wi,q x wi,j x log {r/(t - r)}/{(N + r - t - ni)/(N - r)}
i
Note: With a minor change of notation, this is w4 on page 368 of
Frake.
12
Probabilistic Weighting
w = log
13
r
(R - r)
n-r
(N - R )
N number of documents in collection
R number of relevant documents for query q
n number of documents with term t
r number of relevant documents with term t
r
(R - r)
number of relevant documents with term t
number of relevant documents without term t
n-r
(N - R )
number of non-relevant documents with term t
number of non-relevant documents in collection
Discussion of Probabilistic Model
Advantages
• Based on firm theoretical basis
Disadvantages
• Initial definition of R has to be guessed.
• Weights ignore term frequency
• Assumes independent index terms (as does vector model)
14
Review of Weighting
The objective is to measure the similarity between a document and a
query using statistical (not linguistic) methods.
Concept is to weight terms by some factor based on the distribution
of terms within and between documents.
In general:
(a) Weight is an increasing function of the number of times that the
term appears in the document
(b) Weight is a decreasing function of the number of documents that
contain the term (or the total number of occurrences of the term)
15
(c) Weight needs to be adjusted for documents that differ greatly in
length.
Normalization of Within Document
Frequency (Term Frequency)
Normalization to moderate the effect of high-frequency terms
Croft's normalization:
cfij = K + (1 - K) fij/mi
(fij > 0)
fij is the frequency of term j in document i
cfij is Croft's normalized frequency
mi is the maximum frequency of any term in document i
K is a constant between 0 and 1 that is adjusted for the collection
K should be set to low values (e.g., 0.3) for collections with long
documents (35 or more terms).
K should be set to higher values (greater than 0.5) for collections
with short documents.
16
Normalization of Within Document
Frequency (Term Frequency)
Examples
Croft's normalization:
cfij = K + (1 - K) fij/mi
document
length
17
(fij > 0)
K
mi
weight (most
frequent term)
weight (least
frequent term)
20
0.3
5
1.00
0.44
20
0.3
2
1.00
0.65
100
0.5
25
1.00
0.52
100
0.5
2
1.00
0.75
Measures of Within Document
Frequency
(c) Salton and Buckley recommend using different
weightings for documents and queries
documents
fik for terms in collections of long documents
1 for terms in collections of short document
queries
cfik with K = 0.5 for general use
fik for long queries (cfik with K = 0)
18
Ranking -- Practical Experience
1.
Basic method is inner (dot) product with no weighting
2.
Cosine (dividing by product of lengths) normalizes for
vectors of different lengths
3. Term weighting using frequency of terms in document
usually improves ranking
4. Term weighting using an inverse function of terms in the
entire collection improves ranking (e.g., IDF)
5. Weightings for document structure improve ranking
6.
19
Relevance weightings after initial retrieval improve ranking
Effectiveness of methods depends on characteristics of the
collection. In general, there are few improvements beyond
simple weighting schemes.
Inverse Document Frequency (IDF)
(a) Simplest to use is 1 / dk
dk
(Salton)
number of documents that contain term k
(b) Normalized forms:
IDFi = log2 (N/ni) + 1
or
IDFi = log2 (maxn/ni) + 1
(Sparck Jones)
N
number of documents in the collection
ni
total number of occurrences of term i in the collection
maxn maximum frequency of any term in the collection
20
Download