DSCI 5240 CHAPTER 2 Information retrieval INSTRUCTOR: DR.NICK EVANGELOPOULOS PRESENTED BY: QIUXIA WU Introduction Definition: Information retrieval is the activity of obtaining information resources relevant to an information need from a collection of information resources. History of Modern IR For over 4000 years, humans have been designing tools to improve information storage and retrieval. Vannevar Bush 1945 paper: “As We May Think” The 1st automated information retrieval systems (1950s and 1960) SMART (the System for the Manipulation and Retrieval of Text conceived at Harvard University and flourished at Cornell University under the leadership of Gerard Salton the first practical implementation of an IR system The basic theoretical foundations of SMART still play a major role in today’s IR systems. Modern Information Retrieval Document representation Using keywords Relative weight of keywords Query representation Keywords Relative importance of keywords Retrieval Models Retrieval models match query with documents to: separate documents into relevant an non-relevant class rank the documents according to the relevance Retrieval Models Boolean model Vector space model Probabilistic models Boolean Retrieval Model One of the simplest and most efficient retrieval mechanisms Based on set theory and Boolean algebra Conventional numeric representations of false as 0 and true as 1 Boolean model is interested only in the presence or absence of a term in a document In the term-document matrix replace all the nonzero values with 1 Boolean Model: Advantages Simplicity and efficiency of implementation Binary values can be stored using bits reduced storage requirements retrieval using bitwise operations is efficient Boolean retrieval was adopted by many commercial bibliographic systems Boolean queries are akin to database queries Bibliographic systems: database systems, instead of information retrieval systems Boolean Model: Disadvantages A document is either relevant or nonrelevant to the query It is not possible to assign a degree of relevance Complicated Boolean queries are difficult for users Boolean queries retrieve too few or too many documents. K0 and K4 retrieved only 1 out of 6 documents K0 or K4 retrieved 5 out of a possible 6 documents Vector Space Model Both the documents and queries as vectors A weight based on the frequency in the document: More sophisticated weighting schemes will be studied later VSM versus Boolean Model Queries are easier to express: allow users to attach relative weights to terms A descriptive query can be transformed to a query vector similar to documents Matching between a query and a document is not precise: document is allocated a degree of similarity Documents are ranked based on their similarity scores instead of relevant/nonrelevant classes Users can go through the ranked list until their information needs are met Probabilistic Retrieval Model Sparck-Jones (1976): classical probabilistic retrieval model, also known as the binary independence retrieval model Formulates IR in probabilistic framework Comments on Probabilistic Retrieval Probabilistic independence model is not realistic Two-stage retrieval is more complicated Performance gain over VSM is debatable Evaluation of Retrieval Performance Precision VS. Recall F-measure Average precision Precision and Recall Precision and Recall F measure precision recall 2 precision recall F precision recall precision recall 2 2 precision recall 2 0.67 0.5 0.67 F 0.57 precision recall 0.67 0.5 1.17 Average Precision N precision(i) relevance(i) Average Precision = i 1 R 1.00 1 0.50 0 0.67 1 0.50 0 0.60 1 0.50 0 0.57 1 4 2.84 4 0.71 Average Precision =