Indexing and Representation: The Vector Space Model

advertisement
Indexing and Representation:
The Vector Space Model

Document represented by a vector of terms
 Words (or word stems)
 Phrases (e.g. computer science)
 Removes words on “stop list”




Documents aren’t about “the”
Often assumed that terms are uncorrelated.
Correlations between term vectors implies a similarity
between documents.
For efficiency, an inverted index of terms is often stored.
Document Representation
What values to use for terms




Boolean (term present /absent)
tf (term frequency) - Count of times term occurs in
document.
 The more times a term t occurs in document d the
more likely it is that t is relevant to the document.
 Used alone, favors common words, long documents.
df document frequency
 The more a term t occurs throughout all documents,
the more poorly t discriminates between documents
tf-idf term frequency * inverse document
frequency  High value indicates that the word occurs more often
in this document than average.
Vector Representation


Documents and Queries are represented as
vectors.
Position 1 corresponds to term 1, position 2 to
term 2, position t to term t
Di  wd i1 , wd i 2 ,..., wd it
Q  wq1 , wq 2, ..., wqt
w  0 if a term is absent
Document Vectors
Document ids
nova
1.0
A
0.5
B
C
D
E
F
G
H
I
galaxy heat
0.5
0.3
h’wood
film
role
1.0
0.8
0.7
0.9
1.0
0.5
diet
fur
1.0
0.5
0.7
0.6
1.0
1.0
1.0
0.9
1.0
0.9
0.3
0.2
0.7
0.5
0.8
0.1
0.3
Assigning Weights

Want to weight terms highly if they are
 frequent in relevant documents … BUT
 infrequent in the collection as a whole
Assigning Weights

tf x idf measure:
 term frequency (tf)
 inverse document frequency (idf)
Tk  term k in document Di
tfik  frequency of term Tk in document Di
idf k  inverse document frequency of term Tk in C
N  total number of documents in the collection C
nk  the number of documents in C that contain Tk
idf k  log( nk / N )
tf x idf

Normalize the term weights (so longer documents are not
unfairly given more weight)
wik 
tfik log( N / nk )
2
2
(
tf
)
[log(
N
/
n
)]
k 1 ik
k
t
Now :
t
sim ( Di , D j )   wik  w jk
k 1
tf x idf normalization

Normalize the term weights (so longer documents
are not unfairly given more weight)

normalize usually means force all values to fall within a
certain range, usually between 0 and 1, inclusive.
wik 
tfik log( N / nk )
2
2
(
tf
)
[log(
N
/
n
)]
k 1 ik
k
t
Vector Space Similarity Measure
combine tf x idf into a similarity measure
Di  wd i1 , wd i 2 ,..., wd it
Q  wq1 , wq 2, ..., wqt
w  0 if a term is absent
t
unnormaliz ed similarity :
sim (Q, Di )   wqj  wd ij
j 1
t
cosine :
sim (Q, D2 ) 
w
j 1
qj
t
 (wqj ) 
2
j 1
(cosine is normalized inner product)
 wd ij
t
2
(
w
)
 d ij
j 1
Computing Similarity Scores
D1  (0.8, 0.3)
D2  (0.2, 0.7)
1.0
Q  (0.4, 0.8)
cos 1  0.74
Q
D2
0.8
0.6
0.4
0.2
cos  2  0.98
2
1
0.2
D1
0.4
0.6
0.8
1.0
Documents in Vector Space
t3
D1
D9
D11
D5
D3
D10
D4 D2
t1
t2
D7
D8
D6
Computing a similarity score
Say we have query vect or Q  (0.4,0.8)
Also, document D2  (0.2,0.7)
What does their similarity comparison yield?
sim (Q, D2 ) 
(0.4 * 0.2)  (0.8 * 0.7)
[(0.4) 2  (0.8) 2 ] *[(0.2) 2  (0.7) 2 ]
0.64

 0.98
0.42
Similarity Measures
Simple matching (coordination level match)
|QD|
|QD|
2
|Q|| D|
|QD|
|QD|
|QD|
1
Dice’s Coefficient
Jaccard’s Coefficient
1
|Q | | D |
|QD|
min(| Q |, | D |)
2
2
Cosine Coefficient
Overlap Coefficient
Problems with Vector Space


There is no real theoretical basis for the
assumption of a term space
 it is more for visualization that having any
real basis
 most similarity measures work about the
same regardless of model
Terms are not really orthogonal
dimensions
 Terms are not independent of all other
terms
Probabilistic Models



Rigorous formal model attempts to predict
the probability that a given document will
be relevant to a given query
Ranks retrieved documents according to
this probability of relevance (Probability
Ranking Principle)
Relies on accurate estimates of
probabilities for accurate results
Probabilistic Retrieval


Goes back to 1960’s (Maron and Kuhns)
Robertson’s “Probabilistic Ranking
Principle”
 Retrieved documents should be ranked in
decreasing probability that they are
relevant to the user’s query.
 How to estimate these probabilities?

Several methods (Model 1, Model 2, Model 3)
with different emphases on how estimates are
done.
Probabilistic Models: Some
Notation






D = All present and future documents
Q = All present and future queries
(Di,Qj) = A document query pair
x = class of similar documents,
x

D
y = class of similar queries,
yQ
Relevance is a relation:
R  {(D i , Q j ) | Di  D, Q j  Q, document Di is
judged relevant by the user submitting Q j}
Probabilistic Models: Logistic
Regression
Probability of relevance is based on Logistic
regression from a sample set of documents to
determine values of the coefficients. At retrieval the
probability estimate is obtained by:
6
P ( R | Q, D )  c0   ci X i
i 1
For the 6 X attribute measures shown next
Probabilistic Models: Logistic
Regression attributes
1
X1 
M
M
 log QAF
tj
1
X 2  QL
1
X3 
M
Average Absolute Query Frequency
Query Length
M
 log DAF
tj
Average Absolute Document Frequency
1
X 4  DL
Document Length
1 M
X5 
log IDFt j

M 1
N  nt j
IDF 
nt j
Average Inverse Document Frequency
X 6  log M
Number of Terms in common between
query and document -- logged
Inverse Document Frequency
Probabilistic Models



Advantages
Strong theoretical
basis
In principle should
supply the best
predictions of
relevance given
available information
Can be implemented
similarly to Vector



Disadvantages
Relevance information
is required -- or is
“guestimated”
Important indicators
of relevance may not
be term -- though
terms only are usually
used
Optimally requires ongoing collection of
relevance information
Vector and Probabilistic Models





Support “natural language” queries
Treat documents and queries the same
Support relevance feedback searching
Support ranked retrieval
Differ primarily in theoretical basis and in
how the ranking is calculated
 Vector assumes relevance
 Probabilistic relies on relevance judgments
or estimates
Simple Presentation of Results
Order by similarity
 Decreased order of presumed relevance
 Items retrieved early in search may help
generate feedback by relevance feedback
 Select top k documents
 Select documents within of query

Problems with Vector Space


There is no real theoretical basis for the
assumption of a term space
 it is more for visualization that having any
real basis
 most similarity measures work about the
same regardless of model
Terms are not really orthogonal
dimensions
 Terms are not independent of all other
terms
Evaluation


Relevance
Evaluation of IR Systems
 Precision vs. Recall
 Cutoff Points
 Test Collections/TREC
 Blair & Maron Study
What to Evaluate?




How much learned about the collection?
How much learned about a topic?
How much of the information need is
satisfied?
How inviting the system is?
What to Evaluate?

What can be measured that reflects users’ ability
to use system? (Cleverdon 66)
 Coverage of Information
 Form of Presentation
effectiveness
 Effort required/Ease of Use
 Time and Space Efficiency
 Recall


proportion of relevant material actually retrieved
Precision

proportion of retrieved material actually relevant
Relevance

In what ways can a document be relevant
to a query?
 Answer precise question precisely.
 Partially answer question.
 Suggest a source for more information.
 Give background information.
 Remind the user of other knowledge.
 Others ...
Standard IR Evaluation

Precision
Retrieved
Documents
# relevant retrieved
# retrieved

Recall
# relevant retrieved
# relevant in collection
Collection
Precision/Recall Curves


There is a tradeoff between Precision and Recall
So measure Precision at different levels of Recall
precision
x
x
x
recall
x
Precision/Recall Curves

Difficult to determine which of these two hypothetical
results is better:
precision
x
x
x
recall
x
Precision/Recall Curves
Document Cutoff Levels

Another way to evaluate:
 Fix the number of documents retrieved at several
levels:

Measure precision at each of these levels
 Take (weighted) average over results
This is a way to focus on high precision


top 5, top 10, top 20, top 50, top 100, top 500
The E-Measure
Combine Precision and Recall into one number (van
Rijsbergen 79)
b 2 PR  PR
E  1 2
b PR
P = precision
R = recall
b = measure of relative importance of P or R
For example,
b = 0.5 means user is twice as interested in
precision as recall
TREC




Text REtrieval Conference/Competition
 Run by NIST (National Institute of Standards & Technology)
 1997 was the 6th year
Collection: 3 Gigabytes, >1 Million Docs
 Newswire & full text news (AP, WSJ, Ziff)
 Government documents (federal register)
Queries + Relevance Judgments
 Queries devised and judged by “Information Specialists”
 Relevance judgments done only for those documents
retrieved -- not entire collection!
Competition
 Various research and commercial groups compete
 Results judged on precision and recall, going up to a recall
level of 1000 documents
Sample TREC queries (topics)
<num> Number: 168
<title> Topic: Financing AMTRAK
<desc> Description:
A document will address the role of the Federal Government in
financing the operation of the National Railroad Transportation
Corporation (AMTRAK)
<narr> Narrative: A relevant document must provide
information on the government’s responsibility to make
AMTRAK an economically viable entity. It could also discuss
the privatization of AMTRAK as an alternative to continuing
government subsidies. Documents comparing government
subsidies given to air and bus transportation with those
provided to aMTRAK would also be relevant.
TREC


Benefits:
 made research systems scale to large collections
(pre-WWW)
 allows for somewhat controlled comparisons
Drawbacks:
 emphasis on high recall, which may be unrealistic
for what most users want
 very long queries, also unrealistic
 comparisons still difficult to make, because
systems are quite different on many dimensions
 focus on batch ranking rather than interaction
 no focus on the WWW
TREC Results


Differ each year
For the main track:
 Best systems not statistically significantly different
 Small differences sometimes have big effects




how good was the hyphenation model
how was document length taken into account
Systems were optimized for longer queries and all
performed worse for shorter, more realistic queries
Excitement is in the new tracks
 Interactive
 Multilingual
 NLP
Blair and Maron 1985


Highly influential paper
A classic study of retrieval effectiveness


earlier studies were on unrealistically small collections
Studied an archive of documents for a legal suit



~350,000 pages of text
40 queries
focus on high recall

Used IBM’s STAIRS full-text system

Main Result: System retrieved less than 20% of the relevant
documents for a particular information needs when lawyers
thought they had 75%
But many queries had very high precision

Blair and Maron, cont.

Why recall was low

users can’t foresee exact words and phrases that
will indicate relevant documents
“accident” referred to by those responsible as:
“event,” “incident,” “situation,” “problem,” …
 differing technical terminology
 slang, misspellings


Perhaps the value of higher recall decreases as
the number of relevant documents grows, so more
detailed queries were not attempted once the
users were satisfied
Blair and Maron, cont.

Why recall was low
 users can’t foresee exact words and phrases that
will indicate relevant documents
“accident” referred to by those responsible as:
“event,” “incident,” “situation,” “problem,” …
 differing technical terminology
 slang, misspellings


Perhaps the value of higher recall decreases as
the number of relevant documents grows, so more
detailed queries were not attempted once the
users were satisfied
Download