Information retrieval

advertisement
DSCI 5240
CHAPTER 2
Information retrieval
INSTRUCTOR: DR.NICK EVANGELOPOULOS
PRESENTED BY: QIUXIA WU
Introduction
 Definition: Information retrieval is the activity of
obtaining information resources relevant to an
information need from a collection of information
resources.
History of Modern IR
For over 4000 years, humans have been designing tools to improve information
storage and retrieval.
 Vannevar Bush
1945 paper: “As We May Think”
The 1st automated information retrieval systems (1950s and 1960)
 SMART (the System for the Manipulation and
Retrieval of Text



conceived at Harvard University and flourished at Cornell University
under the leadership of Gerard Salton
the first practical implementation of an IR system
 The basic theoretical foundations of SMART still
play a major role in today’s IR systems.
Modern Information Retrieval
 Document representation
 Using keywords
 Relative weight of keywords
 Query representation
 Keywords
 Relative importance of keywords
Retrieval Models
 Retrieval models match query with documents to:
 separate documents into relevant an non-relevant class
 rank the documents according to the relevance
Retrieval Models
 Boolean model
 Vector space model
 Probabilistic models
Boolean Retrieval Model
 One of the simplest and most efficient retrieval




mechanisms
Based on set theory and Boolean algebra
Conventional numeric representations of false as 0 and
true as 1
Boolean model is interested only in the presence or absence
of a term in a document
In the term-document matrix replace all the nonzero values
with 1
Boolean Model: Advantages
 Simplicity and efficiency of implementation
 Binary values can be stored using bits
 reduced storage requirements
 retrieval using bitwise operations is efficient
 Boolean retrieval was adopted by many
commercial bibliographic systems
 Boolean queries are akin to database queries
 Bibliographic systems:

database systems, instead of information retrieval
systems
Boolean Model: Disadvantages
 A document is either relevant or nonrelevant to the
query
 It is not possible to assign a degree of relevance
 Complicated Boolean queries are difficult for users
 Boolean queries retrieve too few or too many
documents.
K0 and K4 retrieved only 1 out of 6 documents
 K0 or K4 retrieved 5 out of a possible 6 documents

Vector Space Model
 Both the documents and queries as vectors
 A weight based on the frequency in the document:
 More sophisticated weighting schemes will be
studied later
VSM versus Boolean Model
 Queries are easier to express: allow users to attach relative




weights to terms
A descriptive query can be transformed to a query vector
similar to documents
Matching between a query and a document is not precise:
document is allocated a degree of similarity
Documents are ranked based on their similarity scores
instead of relevant/nonrelevant classes
Users can go through the ranked list until their information
needs are met
Probabilistic Retrieval Model
 Sparck-Jones (1976): classical probabilistic retrieval
model, also known as the binary independence
retrieval model
 Formulates IR in probabilistic framework
Comments on Probabilistic Retrieval
 Probabilistic independence model is not realistic
 Two-stage retrieval is more complicated
 Performance gain over VSM is debatable
Evaluation of Retrieval Performance
 Precision VS. Recall
 F-measure
 Average precision
Precision and Recall
Precision and Recall
F measure
precision  recall
2  precision  recall
F

precision  recall
 precision  recall 


2


2  precision  recall 2  0.67  0.5 0.67
F


 0.57
precision  recall
0.67  0.5
1.17
Average Precision
N
 precision(i)  relevance(i)
Average Precision =
i 1
R
1.00 1  0.50  0  0.67 1  0.50  0  0.60 1  0.50  0  0.57 1
4
2.84

4
 0.71
Average Precision =
Download