pptx

advertisement
Information Retrieval
and Web Search
IR models: Vector Space Model
Instructor: Rada Mihalcea
[Note: Some slides in this set were adapted from an IR course taught by Ray Mooney at UT
Austin (who in turn adapted them from Joydeep Ghosh), and from an IR course taught by Chris
Manning at Stanford)
Ch. 6
Ranked Retrieval
• Thus far, our queries have all been Boolean
– Documents either match or don’t
• Good for expert users with precise understanding of
their needs and the collection
– Also good for applications: Applications can easily consume
1000s of results
• Not good for the majority of users
– Most users incapable of writing Boolean queries (or they are, but
they think it’s too much work)
– Most users don’t want to wade through 1000s of results
•
This is particularly true of Web search
Ch. 6
Problem with Boolean Search
• Boolean queries often result in either too few (=0) or too
many (1000s) results.
• Query 1: “standard user dlink 650” → 200,000 hits
• Query 2: “standard user dlink 650 no card found”: 0 hits
• It takes a lot of skill to come up with a query that
produces a manageable number of hits.
– AND gives too few; OR gives too many
Ranked Retrieval Models
• Rather than a set of documents satisfying a query
expression, in ranked retrieval, the system returns an
ordering over the (top) documents in the collection for a
query
• Free text queries: Rather than a query language of
operators and expressions, the user’s query is just one
or more words in a human language
• In principle, there are two separate choices here, but in
practice, ranked retrieval has normally been associated
with free text queries and vice versa
3
Ch. 6
Not a Problem in Ranked Retrieval
• When a system produces a ranked result set, large
result sets are not an issue
– Indeed, the size of the result set is not an issue
– We just show the top k ( ≈ 10) results
– We don’t overwhelm the user
– Premise: the ranking algorithm works
Ch. 6
Scoring as the Basis of Ranked Retrieval
• We wish to return in order the documents most likely to
be useful to the searcher
• How can we rank-order the documents in the collection
with respect to a query?
• Assign a score – say in [0, 1] – to each document
• This score measures how well document and query
“match”.
Query-document Matching Scores
• We need a way of assigning a score to a
query/document pair
• Let’s start with a one-term query
• If the query term does not occur in the document: score
should be 0
• The more frequent the query term in the document, the
higher the score (should be)
Ch. 6
Take 1: Jaccard coefficient
• Jaccard: A commonly used measure of overlap of two
sets A and B
• Jaccard(A,B) = |A ∩ B| / |A ∪ B|
• Jaccard(A,A) = 1
• Jaccard(A,B) = 0 if A ∩ B = 0
• A and B don’t have to be the same size.
• Always assigns a number between 0 and 1.
Ch. 6
Exercise: Jaccard Coefficient
• What is the query-document match score that the
Jaccard coefficient computes for each of the two
documents below?
• Query: march of dimes
• Document 1: caesar died in march
• Document 2: the long march
Ch. 6
Issues with Jaccard for Scoring
• It does not consider term frequency (how many times a
term occurs in a document)
• Rare terms in a collection are more informative than
frequent terms. Jaccard does not consider this
information
• We need a more sophisticated way of normalizing for
length
Sec. 6.2
Recall (from Boolean Retrieval): Binary
Term-Document Incidence Matrix
Antony
Brutus
Caesar
Calpurnia
Cleopatra
mercy
worser
Antony and Cleopatra
Julius Caesar
The Tempest
Hamlet
Othello
Macbeth
1
1
1
0
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
1
1
0
1
1
0
0
1
1
0
0
1
0
0
1
1
1
Each document is represented by a binary vector ∈ {0,1}|V|
0
1
0
0
1
0
Sec. 6.2
Term-document Count Matrices
• Consider the number of occurrences of a term in a
document:
– Each document is a count vector: a column below
Antony and Cleopatra
Julius Caesar
The Tempest
Hamlet
Othello
Macbeth
Antony
157
73
0
0
0
0
Brutus
4
157
0
1
0
0
Caesar
232
227
0
2
1
1
Calpurnia
0
10
0
0
0
0
Cleopatra
57
0
0
0
0
0
mercy
2
0
3
5
5
1
worser
2
0
1
1
1
0
Vector-Space Model
• t distinct terms remain after preprocessing
– Unique terms that form the VOCABULARY
• These “orthogonal” terms form a vector space.
Dimension = t = |vocabulary|
– 2 terms  bi-dimensional; …; t-terms  t-dimensional
• Each term, i, in a document or query j, is given a realvalued weight, wij.
• Both documents and queries are expressed as tdimensional vectors:
dj = (w1j, w2j, …, wtj)
Vector-Space Model
Query as vector:
• We regard query as short document
• We return the documents ranked by the closeness of
their vectors to the query, also represented as a vector.
• Vector-space model was developed in the SMART
system (Salton, c. 1970) and standardly used by TREC
participants and web IR systems
Graphic Representation
Example:
T3
D1 = 2T1 + 3T2 + 5T3
D2 = 3T1 + 7T2 + T3
5
Q = 0T1 + 0T2 + 2T3
D1 = 2T1+ 3T2 + 5T3
Q = 0T1 + 0T2 + 2T3
2
3
T1
D2 = 3T1 + 7T2 + T3
T2
7
• Is D1 or D2 more similar to Q?
• How to measure the degree
of similarity? Distance?
Angle? Projection?
Document Collection Representation
• A collection of n documents can be represented in the
vector space model by a term-document matrix.
• An entry in the matrix corresponds to the “weight” of a
term in the document; zero means the term has no
significance in the document or it simply doesn’t exist in
the document.
T1 T2 ….
Tt
D1 w11 w21 …
wt1
D2 w12 w22 …
wt2
:
:
:
:
:
:
:
:
Dn w1n w2n …
wtn
Term Frequency tf
• The term frequency tfij of term i in document j is defined as
the number of times that term i occurs in document j.
• More frequent terms in a document are more important,
i.e. more indicative of the topic.
• May want to normalize term frequency (tf) :
tfij = freqij / max{fij}
• We want to use tf when computing query-document match
scores. But how?
Term Frequency tf
• Raw term frequency is not what we want:
– A document with 10 occurrences of the term is more relevant
than a document with 1 occurrence of the term.
– But not 10 times more relevant.
• Relevance does not increase proportionally with term
frequency.
Sec. 6.2.1
Document Frequency
• Rare terms are more informative than frequent terms
– Recall stop words
• Consider a term in the query that is rare in the collection
(e.g., arachnocentric)
• A document containing this term is very likely to be relevant
to the query arachnocentric
• → We want a high weight for rare terms like arachnocentric.
Sec. 6.2.1
Document Frequency (continued)
• Frequent terms are less informative than rare terms
• Consider a query term that is frequent in the collection
(e.g., high, increase, line)
• A document containing such a term is more likely to be
relevant than a document that doesn’t
• But it’s not a sure indicator of relevance.
• → For frequent terms, we want high positive weights for
words like high, increase, and line
• But lower weights than for rare terms.
• We will use document frequency (df) to capture this.
Sec. 6.2.1
Idf Weight
• dfi is the document frequency of term i: the number of
documents that contain term i
– dfi is an inverse measure of the informativeness of term i
– dfi  n, where n is the number of documents in the collection
• We define the idf (inverse document frequency) of term i
by
idfi = log10 (N /dft )
– We use log (N/dfi) instead of N/dfi to “dampen” the effect of a
high df (as compared to tf)
Sec. 6.2.1
idf example, suppose N = 1 million
term
calpurnia
dft
idft
1
animal
100
sunday
1,000
fly
10,000
under
the
100,000
1,000,000
idfi = log10 (N /dft )
There is one idf value for each term i in a collection.
Sec. 6.2.1
Collection vs. Document Frequency
• The collection frequency of i is the number of
occurrences of i in the collection, counting multiple
occurrences.
• Example:
Word
Collection frequency
Document frequency
insurance
10440
3997
try
10422
8760
• Which word is a better search term (and should get a
higher weight)?
Sec. 6.2.2
tf-idf Weighting
• The tf-idf weight of a term is the product of its tf weight and
its idf weight.
wij = tfij ´ log10 (N / dfi )
• Best known weighting scheme in information retrieval
– Theoretically proven to work well (Papineni, NAACL 2001)
– Note: the “-” in tf-idf is a hyphen, not a minus sign!
– Alternative names: tf.idf, tf x idf
• Increases with the number of occurrences within a
document
• Increases with the rarity of the term in the collection
Computing tf-idf: An Example
Given a document containing terms with given frequencies:
A(3), B(2), C(1)
Assume collection contains 10,000 documents and
document frequencies of these terms are:
A(50), B(1300), C(250)
Then:
A: tf = 3/3; idf = log(10000/50) = 5.3;
tf-idf = 5.3
B: tf = 2/3; idf = log(10000/1300) = 2.0; tf-idf = 1.3
C: tf = 1/3; idf = log(10000/250) = 3.7; tf-idf = 1.2
Sec. 6.3
Binary → Count → Weight matrix
Antony
Brutus
Caesar
Calpurnia
Cleopatra
mercy
worser
Antony and Cleopatra
Julius Caesar
The Tempest
Hamlet
Othello
Macbeth
5.25
1.21
8.59
0
2.85
1.51
1.37
3.18
6.1
2.54
1.54
0
0
0
0
0
0
0
0
1.9
0.11
0
1
1.51
0
0
0.12
4.15
0
0
0.25
0
0
5.25
0.25
0.35
Each document is now represented by a real-valued
vector of tf-idf weights ∈ R|V|
0
0
0
0
0.88
1.95
Sec. 6.3
Documents as Vectors
• So we have a |V|-dimensional vector space
• Terms are axes of the space
• Documents are points or vectors in this space
• Very high-dimensional: tens of millions of dimensions
when you apply this to a web search engine
• These are very sparse vectors - most entries are zero.
Query as Vector
• Query vector is typically treated as a document and also
tf-idf weighted.
• Alternative is for the user to supply weights for the given
query terms.
Similarity Measure
• We now have vectors for all documents in the
collection, a vector for the query, how to compute
similarity?
• A similarity measure is a function that computes the
degree of similarity between two vectors.
• Using a similarity measure between the query and
each document:
– It is possible to rank the retrieved documents in the order of
presumed relevance.
– It is possible to enforce a certain threshold so that the size of the
retrieved set can be controlled.
First Cut: Euclidean Distance
• Distance between vectors d1 and d2 is the length of the
vector |d1 – d2|.
– Euclidean distance
• Exercise: Determine the Euclidean distance between
the vectors (0, 3, 2, 1, 10) and (2, 7, 1, 0, 0)
• Why is this not a great idea?
• We still haven’t dealt with the issue of length
normalization
– Long documents would be more similar to each other by virtue of
length, not topic
Second Cut: Manhattan Distance
• Or “city block” measure
– Based on the idea that generally in American cities you cannot
follow a direct line between two points.
y
x
• Uses the formula:
n
ManhDist ( X , Y )   | xi  yi |
i 1
• Exercise: Determine the Manhattan distance between
the vectors (0, 3, 2, 1, 10) and (2, 7, 1, 0, 0)
Third Cut: Inner Product
• Similarity between vectors for the documents d1 and
document d2 can be computed as the vector inner
product:
sim(d1, d2 ) = d1 × d2 = å wi1wi2
t
i=1
where wij is the weight of term i in document j
• For binary vectors, the inner product is the number of
matched query terms in the document (size of
intersection).
• For weighted term vectors, it is the sum of the
products of the weights of the matched terms.
Properties of Inner Product
• Favors long documents with a large number of unique
terms.
– Again, the issue of normalization
• Measures how many terms matched but not how many
terms are not matched.
Exercise
d1
d2
d3
d4
d5
d6
d7
k1
1
1
0
1
1
1
0
k2
0
0
1
0
1
1
1
k3
1
0
1
0
1
0
0
q
1
2
3
E(di,q) M(di,q)
di.q
Cosine Similarity
• Distance between vectors d1 and d2 captured by the
cosine of the angle x between them.
• Note – this is similarity, not distance
t3
d2
d1
θ
t1
t2
Cosine Similarity
d1 × d2
sim(d1, d2 ) =
=
d1 d2
å ww
å w å
t
i=1
t
i=1
i1
i2
2
i1
• wij – weight of term i in document j
t
i=1
w
2
i2
• Cosine of angle between two vectors
• The denominator involves the lengths of the vectors
• So the cosine measure is also known as the normalized
inner product
Length d j =
å
t
i=1
w
2
ij
Exercise
• Documents: Sense and Sensibility, Pride and Prejudice;
Wuthering Heights
• Vocabulary of four terms
term
affection
SaS
PaP
WH
term
SaS
PaP
WH
115
58
20
affection
0.789
0.832
0.524
jealous
10
7
11
jealous
0.515
0.555
0.465
gossip
2
0
6
gossip
0.335
0
0.405
wuthering
0
0
38
0
0
0.588
wuthering
• Measure cos(SaS,PaP), cos(SaS,WH)
Exercise
• Rank the following by decreasing cosine similarity:
– Two documents that have only frequent words (the, a, an, of) in
common.
– Two documents that have no words in common.
– Two documents that have many rare words in common
(wingspan, tailfin).
Cosine Similarity vs. Inner Product
t3
• Cosine similarity measures the cosine of the
angle between two vectors.
1
• Inner product normalized by the vector lengths.
 
dj q
 
CosSim(dj, q) =
dj  q
 
InnerProduct(dj, q) = dj q


D1
t

 ( wij  wiq )
2
i 1
t
t
 wij   wiq
i 1
2
i 1
Q
2
t2
D2
D1 = 2T1 + 3T2 + 5T3 CosSim(D1 , Q) = 10 / (4+9+25)(0+0+4) = 0.81
D2 = 3T1 + 7T2 + 1T3 CosSim(D2 , Q) = 2 / (9+49+1)(0+0+4) = 0.13
Q = 0T1 + 0T2 + 2T3
D1 is 6 times better than D2 using cosine similarity but only 5 times better using
inner product.
t1
Comments on Vector Space Models
• Simple, mathematically based approach.
• Considers both local (tf) and global (idf) word occurrence
frequencies.
• Provides partial matching and ranked results.
• Tends to work quite well in practice despite obvious
weaknesses.
• Allows efficient implementation for large document
collections.
Problems with Vector Space Model
• Missing semantic information (e.g. word sense).
• Missing syntactic information (e.g. phrase structure,
word order, proximity information).
• Assumption of term independence
• Lacks the control of a Boolean model (e.g., requiring a
term to appear in a document).
– Given a two-term query “A B”, may prefer a document
containing A frequently but not B, over a document that contains
both A and B, but both less frequently.
Download