Information retrieval

DSCI 5240 CHAPTER 2 Information retrieval INSTRUCTOR: DR.NICK EVANGELOPOULOS PRESENTED BY: QIUXIA WU Introduction  Definition: Information retrieval is the activity of obtaining information resources relevant to an information need from a collection of information resources. History of Modern IR For over 4000 years, humans have been designing tools to improve information storage and retrieval.  Vannevar Bush 1945 paper: “As We May Think” The 1st automated information retrieval systems (1950s and 1960)  SMART (the System for the Manipulation and Retrieval of Text    conceived at Harvard University and flourished at Cornell University under the leadership of Gerard Salton the first practical implementation of an IR system  The basic theoretical foundations of SMART still play a major role in today’s IR systems. Modern Information Retrieval  Document representation  Using keywords  Relative weight of keywords  Query representation  Keywords  Relative importance of keywords Retrieval Models  Retrieval models match query with documents to:  separate documents into relevant an non-relevant class  rank the documents according to the relevance Retrieval Models  Boolean model  Vector space model  Probabilistic models Boolean Retrieval Model  One of the simplest and most efficient retrieval     mechanisms Based on set theory and Boolean algebra Conventional numeric representations of false as 0 and true as 1 Boolean model is interested only in the presence or absence of a term in a document In the term-document matrix replace all the nonzero values with 1 Boolean Model: Advantages  Simplicity and efficiency of implementation  Binary values can be stored using bits  reduced storage requirements  retrieval using bitwise operations is efficient  Boolean retrieval was adopted by many commercial bibliographic systems  Boolean queries are akin to database queries  Bibliographic systems:  database systems, instead of information retrieval systems Boolean Model: Disadvantages  A document is either relevant or nonrelevant to the query  It is not possible to assign a degree of relevance  Complicated Boolean queries are difficult for users  Boolean queries retrieve too few or too many documents. K0 and K4 retrieved only 1 out of 6 documents  K0 or K4 retrieved 5 out of a possible 6 documents  Vector Space Model  Both the documents and queries as vectors  A weight based on the frequency in the document:  More sophisticated weighting schemes will be studied later VSM versus Boolean Model  Queries are easier to express: allow users to attach relative     weights to terms A descriptive query can be transformed to a query vector similar to documents Matching between a query and a document is not precise: document is allocated a degree of similarity Documents are ranked based on their similarity scores instead of relevant/nonrelevant classes Users can go through the ranked list until their information needs are met Probabilistic Retrieval Model  Sparck-Jones (1976): classical probabilistic retrieval model, also known as the binary independence retrieval model  Formulates IR in probabilistic framework Comments on Probabilistic Retrieval  Probabilistic independence model is not realistic  Two-stage retrieval is more complicated  Performance gain over VSM is debatable Evaluation of Retrieval Performance  Precision VS. Recall  F-measure  Average precision Precision and Recall Precision and Recall F measure precision  recall 2  precision  recall F  precision  recall  precision  recall    2   2  precision  recall 2  0.67  0.5 0.67 F    0.57 precision  recall 0.67  0.5 1.17 Average Precision N  precision(i)  relevance(i) Average Precision = i 1 R 1.00 1  0.50  0  0.67 1  0.50  0  0.60 1  0.50  0  0.57 1 4 2.84  4  0.71 Average Precision =

Information retrieval

Related documents

Products

Support

Information retrieval

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib