Search and Retrieval: More on
Term Weighting and Document Ranking
Prof. Marti Hearst
SIMS 202, Lecture 22
Today
Document Ranking
term weights
similarity measures
vector space model
probabilistic models
Multi-dimensional spaces
Clustering
Marti A. Hearst
SIMS 202, Fall 1997
Finding Out About
Three phases:
Asking of a question
Construction of an answer
Assessment of the answer
Part of an iterative process
Marti A. Hearst
SIMS 202, Fall 1997
Ranking Algorithms
Assign weights to the terms in the query.
Assign weights to the terms in the
documents.
Compare the weighted query terms to the
weighted document terms.
Rank order the results.
Marti A. Hearst
SIMS 202, Fall 1997
Information
need
Collections
Pre-process
text input
Parse
Query
Index
Rank
Vector Representation
(revisited; see Salton article in Science)
Documents and Queries are represented as
vectors.
Position 1 corresponds to term 1, position 2
to term 2, position t to term t
The weight of the term is stored in each
position
Di wd i1 , wd i 2 ,...,wd it
Q wq1 , wq 2 ,...,wqt
w 0 if a termis absent
Marti A. Hearst
SIMS 202, Fall 1997
Assigning Weights to Terms
Raw term frequency
tf x idf
Automatically-derived thesaurus terms
Marti A. Hearst
SIMS 202, Fall 1997
Assigning Weights to Terms
Raw term frequency
tf x idf
Recall the Zipf distribution
Want to weight terms highly if they are
frequent in relevant documents … BUT
infrequent in the collection as a whole
Automatically derived thesaurus terms
Marti A. Hearst
SIMS 202, Fall 1997
Assigning Weights
tf x idf measure:
term frequency (tf)
inverse document frequency (idf)
Goal: assign a tf * idf weight to each
term in each document
Marti A. Hearst
SIMS 202, Fall 1997
tf x idf
wik tfik * log(N / nk )
Tk term k in document Di
tf ik frequencyof termTk in document Di
idfk inversedocumentfrequencyof termTk in C
N totalnumber of documentsin thecollectionC
nk the number of documentsin C thatcont ainTk
idfk log(nk / N )
Marti A. Hearst
SIMS 202, Fall 1997
tf x idf normalization
Normalize the term weights (so longer
documents are not unfairly given more
weight)
normalize usually means force all values to fall within a
certain range, usually between 0 and 1, inclusive.
wik
tf ik log(N / nk )
2
2
(
tf
)
[log(
N
/
n
)]
k 1 ik
k
t
Marti A. Hearst
SIMS 202, Fall 1997
Vector space similarity
(use the weights to compare the documents)
Now, thesimilarityof two documentsis :
t
sim( Di , D j ) wik w jk
k 1
T hisis also called thecosine,or normalizedinner product.
(Normalization wasdone when weightingthe terms.)
Marti A. Hearst
SIMS 202, Fall 1997
Vector Space Similarity Measure
combine tf x idf into a similarity measure
Di wd i1 , wd i 2 ,...,wd it
Q wq1 , wq 2, ..., wqt
w 0 if a termis absent
t
if term weights normalized:
sim(Q, Di ) wqj wd ij
j 1
otherwisenormalizein thesimilaritycomparison:
t
sim(Q, Di )
w
j 1
t
qj
wd ij
2
(
w
)
qj
j 1
t
2
(
w
)
d ij
j 1
Marti A. Hearst
SIMS 202, Fall 1997
To Think About
How does this ranking algorithm
behave?
Make a set of hypothetical documents
consisting of terms and their weights
Create some hypothetical queries
How are the documents ranked, depending
on the weights of their terms and the
queries’ terms?
Marti A. Hearst
SIMS 202, Fall 1997
Computing Similarity Scores
D1 (0.8, 0.3)
D2 (0.2, 0.7)
1.0
Q (0.4, 0.8)
Q
D2
cos1 0.74
0.8
0.6
0.4
0.2
cos 2 0.98
2
1
0.2
D1
0.4
0.6
0.8
1.0
Marti A. Hearst
SIMS 202, Fall 1997
Computing a similarity score
Say we havequery vector Q (0.4,0.8)
Also, document D2 (0.2,0.7)
Whatdoes theirsimilarit ycomparisonyield?
sim(Q, D2 )
(0.4 * 0.2) (0.8 * 0.7)
[(0.4) 2 (0.8) 2 ] * [(0.2) 2 (0.7) 2 ]
0.64
0.98
0.42
Marti A. Hearst
SIMS 202, Fall 1997
Other Major Ranking Schemes
Probabilistic Ranking
Attempts to be more theoretically sound
than the vector space (v.s.) model
try to predict the probability of a document’s
being relevant, given the query
there are many many variations
usually more complicated to compute than v.s.
usually many approximations are required
Usually can’t beat v.s. reliably using
standard evaluation measures
Marti A. Hearst
SIMS 202, Fall 1997
Other Major Ranking Schemes
Staged Logistic Regression
A variation on probabilistic ranking
Used successfully here at Berkeley in the
Cheshire II system
Marti A. Hearst
SIMS 202, Fall 1997
Staged Logistic Regression
Pick a set of X feature types
sum of frequencies of all terms in query
x1
sum of frequencies of all query terms in document x2
query length
x3
document length
x4
sum of idf’s for all terms in query
x5
Determine weights, c, to indicate how important each
feature type is (use training examples)
To assign a score to the document:
add up the feature weight times the term weight for each
feature and each term in the query
5
score( D, Q) ci xi
i 1
Marti A. Hearst
SIMS 202, Fall 1997
Multi-Dimensional Space
Documents exist in multi-dimensional space
What does this mean?
Consider a set of objects with features
In what ways can they be grouped?
different shapes
different sizes
different colors
The features define an abstract space that the objects
can reside in.
Generalize this to terms in documents.
There are more than three kinds of terms!
Marti A. Hearst
SIMS 202, Fall 1997
Text Clustering
Clustering is
“The art of finding groups in data.”
-- Kaufmann and Rousseeu
Term 1
Term 2
Marti A. Hearst
SIMS 202, Fall 1997
Text Clustering
Clustering is
“The art of finding groups in data.”
-- Kaufmann and Rousseeu
Term 1
Term
2
Marti A. Hearst
SIMS 202, Fall 1997
Pair-wise Document Similarity
nova
A
1
B
5
C
D
galaxy heat
3
1
h’wood
film
role
2
1
5
4
1
diet
fur
2
How to compute document similarity?
Marti A. Hearst
SIMS 202, Fall 1997
Pair-wise Document Similarity
(no normalization for simplicity)
sim( A, B) (1 5) (2 3) 11
sim( A, C ) 0
D1 w11 , w12, ..., w1t
D2 w21 , w22, ..., w2t
sim( A, D) 0
sim( B, C ) 0
t
sim( D1 , D2 ) w1i w2i
sim( B, D) 0
sim(C , D) (2 4) (1 1) 9
i 1
nova
A
1
B
5
C
D
galaxy heat
3
1
h’wood
film
role
2
1
5
4
1
diet
fur
2
Marti A. Hearst
SIMS 202, Fall 1997
Using Clustering
Cluster entire collection
Find cluster centroid that best matches
the query
This has been explored extensively
it is expensive
it doesn’t work well
Marti A. Hearst
SIMS 202, Fall 1997
Using Clustering
Alternative (scatter/gather):
cluster top-ranked documents
show cluster summaries to user
Seems useful
experiments show relevant docs tend to
end up in the same cluster
users seem able to interpret and use the
cluster summaries some of the time
More computationally feasible
Marti A. Hearst
SIMS 202, Fall 1997
Clustering
Advantages:
See some main themes
Disadvantage:
Many ways documents could group
together are hidden
Marti A. Hearst
SIMS 202, Fall 1997
Using Clustering
Another alternative:
cluster entire collection
force results into a 2D space
display graphically to give an overview
looks neat but hasn’t been shown to be useful
Kohonen feature maps can be used instead
of clustering to produce display of
documents in 2D regions
Marti A. Hearst
SIMS 202, Fall 1997
Clustering Multi-Dimensional
Document Space
(image from Wise et al 95)
Marti A. Hearst
SIMS 202, Fall 1997
Clustering Multi-Dimensional
Document Space
(image from Wise et al 95)
Marti A. Hearst
SIMS 202, Fall 1997
Concept “Landscapes” from
Kohonen Feature Maps
(X. Lin and H. Chen)
Disease
Pharmocology
Anatomy
Hospitals
Legal
Marti A. Hearst
SIMS 202, Fall 1997
Graphical Depictions of Clusters
Problems:
Either
too many concepts, or too
coarse
Only one concept per document
Hard to view titles
Browsing without search
Marti A. Hearst
SIMS 202, Fall 1997
Another Approach to Term
Weighting:
Latent Semantic Indexing
Try to find words that are similar in
meaning to other words by:
computing document by term matrix
a matrix is a two-dimensional vector
processing the matrix to pull out the main
themes
Marti A. Hearst
SIMS 202, Fall 1997
Document/Term Matrix
T1
T2
... Tt
D1
d11
d12
... d1t
D2
d 21
d 22
... d 2t
.
.
.
.
.
.
.
.
Dn
d n1
dij value of Tj in Di
d n 2 ... d nt
Marti A. Hearst
SIMS 202, Fall 1997
Finding Similar Tokens
T1
T2
... Tt
D1
d11
d12
... d1t
D2
d 21
d 22
... d 2t
.
.
.
.
.
.
.
.
Dn
d n1
d n 2 ... d nt
dij value of Tj in Di
n
sim(T j , Tk ) dij dik
i 1
Two terms are considered similar if they co-occur
often in many documents.
Marti A. Hearst
SIMS 202, Fall 1997
Document/Term Matrix
This approach doesn’t work well
Problems:
Word contexts too large
Polysemy
Alternative Approaches
Use Smaller Contexts
Machine-Readable Dictionaries
Local syntactic structure
LSI (Latent Semantic Indexing)
Find main themes within matrix
Marti A. Hearst
SIMS 202, Fall 1997