presentation source

advertisement
Search and Retrieval: More on
Term Weighting and Document Ranking
Prof. Marti Hearst
SIMS 202, Lecture 22
Today

Document Ranking
term weights
 similarity measures

vector space model
 probabilistic models

Multi-dimensional spaces
 Clustering

Marti A. Hearst
SIMS 202, Fall 1997
Finding Out About

Three phases:
Asking of a question
 Construction of an answer
 Assessment of the answer


Part of an iterative process
Marti A. Hearst
SIMS 202, Fall 1997
Ranking Algorithms




Assign weights to the terms in the query.
Assign weights to the terms in the
documents.
Compare the weighted query terms to the
weighted document terms.
Rank order the results.
Marti A. Hearst
SIMS 202, Fall 1997
Information
need
Collections
Pre-process
text input
Parse
Query
Index
Rank
Vector Representation
(revisited; see Salton article in Science)



Documents and Queries are represented as
vectors.
Position 1 corresponds to term 1, position 2
to term 2, position t to term t
The weight of the term is stored in each
position
Di  wd i1 , wd i 2 ,...,wd it
Q  wq1 , wq 2 ,...,wqt
w  0 if a termis absent
Marti A. Hearst
SIMS 202, Fall 1997
Assigning Weights to Terms
Raw term frequency
 tf x idf
 Automatically-derived thesaurus terms

Marti A. Hearst
SIMS 202, Fall 1997
Assigning Weights to Terms
Raw term frequency
 tf x idf

Recall the Zipf distribution
 Want to weight terms highly if they are

frequent in relevant documents … BUT
 infrequent in the collection as a whole


Automatically derived thesaurus terms
Marti A. Hearst
SIMS 202, Fall 1997
Assigning Weights

tf x idf measure:
term frequency (tf)
 inverse document frequency (idf)


Goal: assign a tf * idf weight to each
term in each document
Marti A. Hearst
SIMS 202, Fall 1997
tf x idf
wik  tfik * log(N / nk )
Tk  term k in document Di
tf ik  frequencyof termTk in document Di
idfk  inversedocumentfrequencyof termTk in C
N  totalnumber of documentsin thecollectionC
nk  the number of documentsin C thatcont ainTk
idfk  log(nk / N )
Marti A. Hearst
SIMS 202, Fall 1997
tf x idf normalization

Normalize the term weights (so longer
documents are not unfairly given more
weight)

normalize usually means force all values to fall within a
certain range, usually between 0 and 1, inclusive.
wik 
tf ik log(N / nk )
2
2
(
tf
)
[log(
N
/
n
)]
k 1 ik
k
t
Marti A. Hearst
SIMS 202, Fall 1997
Vector space similarity
(use the weights to compare the documents)
Now, thesimilarityof two documentsis :
t
sim( Di , D j )   wik  w jk
k 1
T hisis also called thecosine,or normalizedinner product.
(Normalization wasdone when weightingthe terms.)
Marti A. Hearst
SIMS 202, Fall 1997
Vector Space Similarity Measure
combine tf x idf into a similarity measure
Di  wd i1 , wd i 2 ,...,wd it
Q  wq1 , wq 2, ..., wqt
w  0 if a termis absent
t
if term weights normalized:
sim(Q, Di )   wqj  wd ij
j 1
otherwisenormalizein thesimilaritycomparison:
t
sim(Q, Di ) 
w
j 1
t
qj
 wd ij
2
(
w
)
 qj 
j 1
t
2
(
w
)
 d ij
j 1
Marti A. Hearst
SIMS 202, Fall 1997
To Think About

How does this ranking algorithm
behave?
Make a set of hypothetical documents
consisting of terms and their weights
 Create some hypothetical queries
 How are the documents ranked, depending
on the weights of their terms and the
queries’ terms?

Marti A. Hearst
SIMS 202, Fall 1997
Computing Similarity Scores
D1  (0.8, 0.3)
D2  (0.2, 0.7)
1.0
Q  (0.4, 0.8)
Q
D2
cos1  0.74
0.8
0.6
0.4
0.2
cos 2  0.98
2
1
0.2
D1
0.4
0.6
0.8
1.0
Marti A. Hearst
SIMS 202, Fall 1997
Computing a similarity score
Say we havequery vector Q  (0.4,0.8)
Also, document D2  (0.2,0.7)
Whatdoes theirsimilarit ycomparisonyield?
sim(Q, D2 ) 
(0.4 * 0.2)  (0.8 * 0.7)
[(0.4) 2  (0.8) 2 ] * [(0.2) 2  (0.7) 2 ]
0.64

 0.98
0.42
Marti A. Hearst
SIMS 202, Fall 1997
Other Major Ranking Schemes

Probabilistic Ranking

Attempts to be more theoretically sound
than the vector space (v.s.) model
try to predict the probability of a document’s
being relevant, given the query
 there are many many variations
 usually more complicated to compute than v.s.
 usually many approximations are required


Usually can’t beat v.s. reliably using
standard evaluation measures
Marti A. Hearst
SIMS 202, Fall 1997
Other Major Ranking Schemes

Staged Logistic Regression
A variation on probabilistic ranking
 Used successfully here at Berkeley in the
Cheshire II system

Marti A. Hearst
SIMS 202, Fall 1997
Staged Logistic Regression

Pick a set of X feature types







sum of frequencies of all terms in query
x1
sum of frequencies of all query terms in document x2
query length
x3
document length
x4
sum of idf’s for all terms in query
x5
Determine weights, c, to indicate how important each
feature type is (use training examples)
To assign a score to the document:

add up the feature weight times the term weight for each
feature and each term in the query
5
score( D, Q)   ci xi
i 1
Marti A. Hearst
SIMS 202, Fall 1997
Multi-Dimensional Space


Documents exist in multi-dimensional space
What does this mean?

Consider a set of objects with features




In what ways can they be grouped?


different shapes
different sizes
different colors
The features define an abstract space that the objects
can reside in.
Generalize this to terms in documents.

There are more than three kinds of terms!
Marti A. Hearst
SIMS 202, Fall 1997
Text Clustering
Clustering is
“The art of finding groups in data.”
-- Kaufmann and Rousseeu
Term 1
Term 2
Marti A. Hearst
SIMS 202, Fall 1997
Text Clustering
Clustering is
“The art of finding groups in data.”
-- Kaufmann and Rousseeu
Term 1
Term
2
Marti A. Hearst
SIMS 202, Fall 1997
Pair-wise Document Similarity
nova
A
1
B
5
C
D
galaxy heat
3
1
h’wood
film
role
2
1
5
4
1
diet
fur
2
How to compute document similarity?
Marti A. Hearst
SIMS 202, Fall 1997
Pair-wise Document Similarity
(no normalization for simplicity)
sim( A, B)  (1  5)  (2  3)  11
sim( A, C )  0
D1  w11 , w12, ..., w1t
D2  w21 , w22, ..., w2t
sim( A, D)  0
sim( B, C )  0
t
sim( D1 , D2 )   w1i  w2i
sim( B, D)  0
sim(C , D)  (2  4)  (1 1)  9
i 1
nova
A
1
B
5
C
D
galaxy heat
3
1
h’wood
film
role
2
1
5
4
1
diet
fur
2
Marti A. Hearst
SIMS 202, Fall 1997
Using Clustering
Cluster entire collection
 Find cluster centroid that best matches
the query
 This has been explored extensively

it is expensive
 it doesn’t work well

Marti A. Hearst
SIMS 202, Fall 1997
Using Clustering

Alternative (scatter/gather):
cluster top-ranked documents
 show cluster summaries to user


Seems useful
experiments show relevant docs tend to
end up in the same cluster
 users seem able to interpret and use the
cluster summaries some of the time


More computationally feasible
Marti A. Hearst
SIMS 202, Fall 1997
Clustering

Advantages:


See some main themes
Disadvantage:

Many ways documents could group
together are hidden
Marti A. Hearst
SIMS 202, Fall 1997
Using Clustering

Another alternative:
cluster entire collection
 force results into a 2D space
 display graphically to give an overview



looks neat but hasn’t been shown to be useful
Kohonen feature maps can be used instead
of clustering to produce display of
documents in 2D regions
Marti A. Hearst
SIMS 202, Fall 1997
Clustering Multi-Dimensional
Document Space
(image from Wise et al 95)
Marti A. Hearst
SIMS 202, Fall 1997
Clustering Multi-Dimensional
Document Space
(image from Wise et al 95)
Marti A. Hearst
SIMS 202, Fall 1997
Concept “Landscapes” from
Kohonen Feature Maps
(X. Lin and H. Chen)
Disease
Pharmocology
Anatomy
Hospitals
Legal
Marti A. Hearst
SIMS 202, Fall 1997
Graphical Depictions of Clusters
 Problems:
 Either
too many concepts, or too
coarse
 Only one concept per document
 Hard to view titles
 Browsing without search
Marti A. Hearst
SIMS 202, Fall 1997
Another Approach to Term
Weighting:
Latent Semantic Indexing

Try to find words that are similar in
meaning to other words by:

computing document by term matrix


a matrix is a two-dimensional vector
processing the matrix to pull out the main
themes
Marti A. Hearst
SIMS 202, Fall 1997
Document/Term Matrix
T1
T2
... Tt
D1
d11
d12
... d1t
D2
d 21
d 22
... d 2t
.
.
.
.
.
.
.
.
Dn
d n1
dij  value of Tj in Di
d n 2 ... d nt
Marti A. Hearst
SIMS 202, Fall 1997
Finding Similar Tokens
T1
T2
... Tt
D1
d11
d12
... d1t
D2
d 21
d 22
... d 2t
.
.
.
.
.
.
.
.
Dn
d n1
d n 2 ... d nt
dij  value of Tj in Di
n
sim(T j , Tk )   dij  dik
i 1
Two terms are considered similar if they co-occur
often in many documents.
Marti A. Hearst
SIMS 202, Fall 1997
Document/Term Matrix


This approach doesn’t work well
Problems:



Word contexts too large
Polysemy
Alternative Approaches

Use Smaller Contexts



Machine-Readable Dictionaries
Local syntactic structure
LSI (Latent Semantic Indexing)

Find main themes within matrix
Marti A. Hearst
SIMS 202, Fall 1997
Download