Morteza zihayat

advertisement
Information Retrieval
and
Vector Space Model
Computational Linguiestic Course
Instructor: Professor Cercone
Presenter: Morteza zihayat
Outline
Introduction to IR
IR System Architecture
Vector Space Model (VSM)
How to Assign Weights?
TF-IDF Weighting
Example
Advantages and Disadvantages of VS Model
Improving the VS Model
2
Information Retrieval and Vector Space Model
Outline
Introduction to IR
IR System Architecture
Vector Space Model (VSM)
How to Assign Weights?
TF-IDF Weighting
Example
Advantages and Disadvantages of VS Model
Improving the VS Model
3
Information Retrieval and Vector Space Model
Introduction to IR
The world's total yearly production of unique
information stored in the form of print, film, optical,
and magnetic content would require roughly 1.5
billion gigabytes of storage. This is the equivalent of
250 megabytes per person for each man, woman,
and child on earth.
(Lyman & Hal 00)
4
Information Retrieval and Vector Space Model
Growth of textual information
Literature
WWW
How can we help manage and
exploit all the information?
News
5
Email
Desktop
Blog
Intranet Retrieval and Vector Space Model
Information
Information overflow
6
Information Retrieval and Vector Space Model
What is Information Retrieval (IR)?
Narrow-sense:


IR= Search Engine Technologies (IR=Google, library info
system)
IR= Text matching/classification
Broad-sense: IR = Text Information Management:




General problem: how to manage text information?
How to find useful information? (retrieval)
 Example: Google
How to organize information? (text classification)
 Example: Automatically assign emails to different folders
How to discover knowledge from text? (text mining)
 Example: Discover correlation of events
7
Information Retrieval and Vector Space Model
Outline
Introduction to IR
IR System Architecture
Vector Space Model (VSM)
How to Assign Weights?
TF-IDF Weighting
Example
Advantages of VS Model
Improving the VSM Model
8
Information Retrieval and Vector Space Model
Formalizing IR Tasks
Vocabulary: V = {w1,w2, …, wT} of a language
Query: q = q1, q2, …, qm where qi ∈V.
Document: di= di1, di2, …, dimi where dij∈V.
Collection: C = {d1, d2, …, dN}
Relevant document set: R(q) ⊆C:Generally
unknown and user-dependent
Query provides a “hint” on which documents
should be in R(q)
IR: find the approximate relevant document set
R’(q)
9
Source: This slide is borrowed from [1]
Information Retrieval and Vector Space Model
Evaluation measures
The quality of many retrieval systems depends on
how well they manage to rank relevant
documents.
How can we evaluate rankings in IR?


IR researchers have developed evaluation measures
specifically designed to evaluate rankings.
Most of these measures combine precision and recall in a
way that takes account of the ranking.
10
Information Retrieval and Vector Space Model
Precision & Recall
11
Source: This slide is borrowed from [1]
Information Retrieval and Vector Space Model
In other words:
Precision is the percentage of relevant items in the
returned set
Recall is the percentage of all relevant documents
in the collection that is in the returned set.
12
Information Retrieval and Vector Space Model
Evaluating Retrieval Performance
13
Source: This slide is borrowed from [1]
Information Retrieval and Vector Space Model
IR System Architecture
docs
INDEXING
Query
Doc
Rep
Rep
SEARCHING
Ranking
Feedback
query
User
results
INTERFACE
judgments
QUERY MODIFICATION
14
Information Retrieval and Vector Space Model
Indexing Document
Break
documents
into words
Stop list
Stemming
Construct
Index
Information Retrieval and Vector Space Model
15
Searching
Given a query, score documents efficiently
The basic question:

Given a query, how do we know if document A is more
relevant than B?
 If document A uses more query words than document B
 Word usage in document A is more similar to that in
query
 ….
We should find a way to compute relevance

Query and documents
16
Information Retrieval and Vector Space Model
The Notion of Relevance
Relevance
(Rep(q), Rep(d))
Similarity
Different
rep & similarity
Regression
Model
(Fox 83)
…
Vector space
Prob. distr.
model
model
(Salton et al., 75) (Wong & Yao, 89)
Today’s lecture
P(d q) or P(q d)
Probabilistic inference
P(r=1|q,d) r {0,1}
Probability of Relevance
Generative
Model
Doc
generation
Query
generation
Different
inference system
Prob. concept
space model
(Wong & Yao, 95)
Classical
LM
prob. Model
approach
(Robertson &
(Ponte & Croft, 98)
Sparck Jones, 76) (Lafferty & Zhai, 01a)
Inference
network
model
(Turtle & Croft, 91)
Infor
17
Relevance = Similarity
Assumptions



Query and document are represented similarly
A query can be regarded as a “document”
Relevance(d,q)  similarity(d,q)
R(q) = {dC|f(d,q)>}, f(q,d)=(Rep(q), Rep(d))
Key issues


How to represent query/document?
 Vector Space Model (VSM)
How to define the similarity measure ?
Information Retrieval and Vector Space Model
18
Outline
Introduction to IR
IR System Architecture
Vector Space Model (VSM)
How to Assign Weights?
TF-IDF Weighting
Example
Advantages of VS Model
Improving the VSM Model
19
Information Retrieval and Vector Space Model
Vector Space Model (VSM)
The vector space model is one of the most widely
used models for ad-hoc retrieval
Used in information filtering, information
retrieval, indexing and relevancy rankings.
20
Information Retrieval and Vector Space Model
VSM
Represent a doc/query by a term vector




Term: basic concept, e.g., word or phrase
Each term defines one dimension
N terms define a high-dimensional space
E.g., d=(x1,…,xN), xi is “importance” of term I
Measure relevance by the distance between the
query vector and document vector in the vector
space
21
Information Retrieval and Vector Space Model
VS Model: illustration
Starbucks
D2
D9
D11
??
??
D5
D3
D10
D4 D6
Query
D7
D8
Microsoft
D1
Java
??
Infor
22
Some Issues about VS Model
There is no consistent definition for basic concept
Assigning weights to words has not been
determined

Weight in query indicates importance of term
24
Information Retrieval and Vector Space Model
Outline
Introduction to IR
IR System Architecture
Vector Space Model (VSM)
How to Assign Weights?
TF-IDF Weighting
Example
Advantages of VS Model
Improving the VSM Model
25
Information Retrieval and Vector Space Model
How to Assign Weights?
Different terms have different importance in a text
A term weighting scheme plays an important role
for the similarity measure.

Higher weight = greater impact
We now turn to the question of how to weight
words in the vector space model.
26
Information Retrieval and Vector Space Model
There are three components in a weighting
scheme:

gi: the global weight of the ith term,

tij: is the local weight of the ith term in the jth document,

dj:the normalization factor for the jth document
27
Information Retrieval and Vector Space Model
Outline
Introduction to IR
IR System Architecture
Vector Space Model (VSM)
How to Assign Weights?
TF-IDF Weighting
Example
Advantages of VS Model
Improving the VSM Model
29
Information Retrieval and Vector Space Model
TF Weighting
Idea: A term is more important if it occurs more
frequently in a document
Formulas: Let f(t,d) be the frequency count of term t
in doc d



Raw TF: TF(t,d) = f(t,d)
Log TF: TF(t,d)=log f(t,d)
Maximum frequency normalization:
TF(t,d) = 0.5 +0.5*f(t,d)/MaxFreq(d)
Normalization of TF is very important!
Information Retrieval and Vector Space Model
30
TF Methods
31
Information Retrieval and Vector Space Model
IDF Weighting
Idea: A term is more discriminative if it occurs
only in fewer documents
Formula:
IDF(t) = 1 + log(n/k)
n : total number of docs
k : # docs with term t (doc freq)
Information Retrieval and Vector Space Model
32
IDF weighting Methods
33
Information Retrieval and Vector Space Model
TF Normalization
Why?


Document length variation
“Repeated occurrences” are less informative than the “first
occurrence”
Two views of document length


A doc is long because it uses more words
A doc is long because it has more contents
Generally penalize long doc, but avoid overpenalizing
Information Retrieval and Vector Space Model
34
TF-IDF Weighting
TF-IDF weighting : weight(t,d)=TF(t,d)*IDF(t)


Common in doc  high tf  high weight
Rare in collection high idf high weight
Imagine a word count profile, what kind of terms
would have high weights?
Information Retrieval and Vector Space Model
35
How to Measure Similarity?

Di  ( wi1 ,...,wiN )

Q  ( wq1 ,...,wqN )
Dot product similarity:
w  0 if a termis absent
N
 
SC(Q, Di )   wqj  wij
j 1
N
Cosine :
 
sim(Q, Di ) 
w
j 1
qj
N
 wij
 (wqj ) 
2
j 1
N
2
(
w
)
 ij
j 1
( normalizeddot product)
Information Retrieval and Vector Space Model
36
Outline
Introduction to IR
IR System Architecture
Vector Space Model (VSM)
How to Assign Weights?
TF-IDF Weighting
Example
Advantages of VS Model
Improving the VSM Model
37
Information Retrieval and Vector Space Model
VS Example: Raw TF & Dot Product
doc1
information
retrieval
search
engine
information
Sim(q,doc1)=4.8*2.4+4.5*4.5
query=“information retrieval”
Sim(q,doc2)=2.4*2.4
travel
information
doc2
map
travel
doc3
Sim(q,doc3)=0
government IDF
president (fake)
congress Doc1
Doc2
…
Info.
Retrieval
Travel
Map
Search
Engine
Govern.
President
Congress
2.4
4.5
2.8
3.3
2.1
5.4
2.2
3.2
4.3
2(4.8)
1(4.5)
1(2.1)
1(5.4)
1(2.2)
1(3.2)
1(4.3)
1(2.4)
2(5.6)
1(3.3)
Doc3
Query
1(2.4)
1(4.5)
Information Retrieval and Vector Space
Model
38
Example
Q: “gold silver truck”
• D1: “Shipment of gold delivered in a fire”
• D2: “Delivery of silver arrived in a silver truck”
• D3: “Shipment of gold arrived in a truck”
• Document Frequency of the jth term (dfj )
• Inverse Document Frequency (idf) = log10(n / dfj)
Tf*idf is used as term weight here
Information Retrieval and Vector Space
Model
39
Example (Cont’d)
Id
1
2
3
4
5
6
7
8
9
10
11
Term
a
arrived
damaged
delivery
fire
gold
in
of
silver
shipment
truck
df
3
2
1
1
1
1
3
3
1
2
2
Information Retrieval and Vector Space
Model
idf
0
0.176
0.477
0.477
0.477
0.176
0
0
0.477
0.176
0.176
40
Example(Cont’d)
Tf*idf is used here
doc
t1
t2
t3
t4
t5
D1
0
0
0.477
0
D2
0
0.176
0
0.477
0
D3
0
0.176
0
0
Q
0
0
0
0
t6
t7
t8
t9
t10
t11
0
0
0
0.176
0
0
0
0
0.954
0
0.176
0
0.176
0
0
0
0
0.176
0
0
0.477
0.477 0.176
0.176 0.176
0
0.176
SC(Q, D1 ) = (0)(0) + (0)(0) + (0)(0.477) + (0)(0) + (0)(0.477)+ (0.176)(0.176) + (0)(0) + (0)(0) = 0.031
SC(Q, D2 ) = 0.486
SC(Q,D3) = 0.062
The ranking would be D2,D3,D1.
• This SC uses the dot product.
Information Retrieval and Vector Space
Model
41
Outline
Introduction to IR
IR System Architecture
Vector Space Model (VSM)
How to Assign Weights?
TF-IDF Weighting
Example
Advantages and Disadvantages of VS Model
Improving the VSM Model
42
Information Retrieval and Vector Space Model
Advantages of VS Model
Empirically effective! (Top TREC performance)
Intuitive
Easy to implement
Well-studied/Most evaluated
The Smart system


Developed at Cornell: 1960-1999
Still widely used
Warning: Many variants of TF-IDF!
Information Retrieval and Vector Space Model
43
Disadvantages of VS Model
Assume term independence
Assume query and document to be the same
Lots of parameter tuning!
Information Retrieval and Vector Space Model
44
Outline
Introduction to IR
IR System Architecture
Vector Space Model (VSM)
How to Assign Weights?
TF-IDF Weighting
Example
Advantages and Disadvantages of VS Model
Improving the VSM Model
45
Information Retrieval and Vector Space Model
Improving the VSM Model
We can improve the model by:



Reducing the number of dimensions
 eliminating all stop words and very common terms
 stemming terms to their roots
 Latent Semantic Analysis
Not retrieving documents below a defined cosine threshold
Normalized frequency of a term i in document j is given
by :
 Normalized Document Frequencies
 Normalized Query Frequencies
[1]
46
Information Retrieval and Vector Space Model
Stop List
 Function words do not bear useful information for IR
of, not, to, or, in, about, with, I, be, …
 Stop list: contain stop words, not to be used as index





Prepositions
Articles
Pronouns
Some adverbs and adjectives
Some frequent words (e.g. document)
 The removal of stop words usually improves IR
effectiveness
 A few “standard” stop lists are commonly used.
Information Retrieval and Vector Space Model
47
Stemming

Reason:
◦ Different word forms may bear similar meaning (e.g. search,
searching): create a “standard” representation for them

Stemming:
◦ Removing some endings of word
dancer
dancers
dance
danced
dancing
dance
48
Information Retrieval and Vector Space Model
Stemming(Cont’d)
Two main methods :
Linguistic/dictionary-based stemming


high stemming accuracy
high implementation and processing costs and higher
coverage
Porter-style stemming



lower stemming accuracy
lower implementation and processing costs and lower
coverage
Usually sufficient for IR
Information Retrieval and Vector Space Model
49
Latent Semantic Indexing (LSI) [3]
Reduces the dimensions of the term-document
space
Attempts to solve the synonomy and polysemy
Uses Singular Value Decomposition (SVD)

identifies patterns in the relationships between the terms
and concepts contained in an unstructured collection of text
Based on the principle that words that are used in
the same contexts tend to have similar meanings.
50
Information Retrieval and Vector Space Model
LSI Process
In general, the process involves:



constructing a weighted term-document matrix
performing a Singular Value Decomposition on the
matrix
using the matrix to identify the concepts contained in the
text
LSI statistically analyses the patterns of word
usage across the entire document collection
51
Information Retrieval and Vector Space Model
References
 Introduction to Information Retrieval, by Christopher D. Manning, Prabhakar Raghavan, and
Hinrich Schuetze

https://wiki.cse.yorku.ca/course_archive/2010-11/F/6390/_media/2.pdf

https://wiki.cse.yorku.ca/course_archive/2010-11/F/6390/_media/ir4up.pdf

https://wiki.cse.yorku.ca/course_archive/2010-11/F/6390/_media/e09-3009.pdf

https://wiki.cse.yorku.ca/course_archive/2010-11/F/6390/_media/07models-vsm.pdf

https://wiki.cse.yorku.ca/course_archive/2010-11/F/6390/_media/03vectorspaceimplementation-6per.pdf
 https://wiki.cse.yorku.ca/course_archive/2011-12/F/6339/_media/lecture02.ppt
 https://wiki.cse.yorku.ca/course_archive/2011-12/F/6339/_media/vector_space_modelupdated.ppt
 https://wiki.cse.yorku.ca/course_archive/201112/F/6339/_media/lecture_13_ir_and_vsm_.ppt

Document Classification based on Wikipedia Content,
http://www.iicm.tugraz.at/cguetl/courses/isr/opt/classification/Vector_Space_Model.html?tim
54
estamp=1318275702299
Information Retrieval and Vector Space Model
Thanks For Your Attention ….
55
Infor
Download