Information Retrieval Unit 1 Seema Chandak Unit 1 : Objective & Content Objective To deal with IR representation, storage, organization & access to information items. Unit 1 : Content(contu..) • Content :: Basic Concepts of IR, Data Retrieval & Information Retrieval, IR system block diagram. Automatic Text Analysis, Luhn's ideas, Conflation Algorithm, Indexing and Index Term Weighing, Probabilistic Indexing, Unit 1 : Content(contu…) Automatic Classification. Measures of Association, Different Matching Coefficient, Classification Methods, Cluster Hypothesis. Clustering Algorithms, Single Pass Algorithm, Single Link Algorithm, Rochhio's Algorith Dendogram What is IR Information retrieval: Subfield of computer science that deals with automated retrieval of infromaition (especially text) based on their content and context. The term Information Retrieval was first coined by Calvin Moores (1950). “ It is concerned with the representation, storage, and organization and accessing of information items .“ Need for IR • Information is considered as the most important source, for most of the activities. • Example : Timely Weather reports. • Timely sharing of information. • The timely retrieval of information plays a major role, keeping with the motto “right information at the right time”. Types of IR – Structured (All Database management systems) – Unstructured (Search engines) – Semi structured(Datawarehouses) IR Based on Structured Data • Recollect Terms related to DBMS .. – Data Organization in the form of schema, keys, index, metadata…. – Query structure – Results set – ….. – …. Why IR ?Why not Database? What are some limitations of Database Systems? IR Vs. DR Information Retrieval System: a system that allows a user to retrieve documents that match her “information need” from a large corpus. Example: Get documents about Java, except for ones that are about the Java coffee. Data Retrieval System: a system that allows a user to retrieve all documents that match her query from a large corpus. Example: Get all documents containing the term “Java” but no containing the term “coffee”. IR Vs. DR 1. Matching. – In data retrieval we are normally looking for an exact match, that is, we are checking to see whether an item is or is not present in the file. – Eg.Select * from Student where per >= 75.0 – In information retrieval more generally we want to find those items which partially match the request and then select from those a few of the best matching ones. – Eg. Student having 75 or >75 percentage from student of pict college. 2. Inference IR Vs. DR – In data retrieval is of the simple deductive kind, that is, aRb and bRc then aRc. – In information retrieval it is of inductive inference; – Relations are only specified with a degree of certainty or uncertainty and hence our confidence in the inference is variable. 3. Model – Data retrieval is deterministic but information retrieval is probabilistic. – Frequently Bayes' Theorem is invoked to carry out inferences in IR, but in DR probabilities do not enter – into the processing. IR Vs. DR 4 .Classification: – In DR most likely monothetic classification is used. – That is, one with classes defined by objects – possessing attributes both necessary and sufficient to belong to a class. – In IR such a classification is not very useful. – A polythetic classification is mostly used. – Each individual in a class will possess only a proportion of all the attributes possessed by all the members of that class. – Hence no attribute is necessary nor sufficient for membership to a class. IR Vs. DR 5. Query Language: – The query language for DR is one with restricted syntax and vocabulary. – In IR we prefer to use natural language although there are some notable exceptions. 6. Query Specification : – In DR the query is generally a complete specification of what is wanted, – In IR it is invariably incomplete. IR Vs. DR 7. Items wanted : – In IR we are searching for relevant documents as opposed to exactly matching items in DR. 8. Error response : – DR is more sensitive to error in the sense that, an error in matching will not retrieve the wanted item which implies a total failure of the system. – In IR small errors in matching generally do not affect performance of the system significantly IR Vs. DR Data Retrieval (DR) Matching Exact match Inference Deduction Information Retrieval (IR) Partial match, best match Induction Model Deterministic Probabilistic Classification Monothetic Polythetic Data Database tables, structured Free text, unstructured Query language Query specification Items wanted Artificial, SQL, relational algebras. Complete Natural, Keywords, free text Incomplete Matching Relevant IR vs.DR Information Retrieval Data Retrieval Error Response Insensitive Sensitive Results Approximate matches Exact matches Results Ordered by relevance Unordered Accessibility Non-expert humans Knowledgeable users or automatic processes Issues with Information Retrieval? Information Retrieval deals with uncertainty and vagueness in information systems. • Uncertainty: available representation does typically not reflect true semantics/meaning of objects (text, images, video, etc.) • Vagueness: information need of user lacks clarity, is only vaguel expressed in query, feedback or user actions. • Differs conceptually from database queries! Re Call the Definition • What Is IR ? • “ Finding some desired information in large data sets or store of information “ • Means : – Searching for documents – Searching for information in documents – Searching for metadata which describes documents – Searching within database – • Web search engines like Google and Lycos are the most visible IR applications. • IR systems are used to reduce information overload. Definition Automatic Information Retrieval Automatic – as against ‘manual’. Information – as against ‘data’. Defn : An information retrieval system does not inform (i.e.change the knowledge of) the user on the subject of his inquiry. It merely informs on the existence (or non-existence) and whereabouts of documents relating to his request. Media – Where Does Information Reside? • Text documents: web pages, books, articles , papers, emails etc. • Manuscripts • Graphics & Images • Speech & Video • Maps & Satellite Imagery • Local Information, Yellow Pages • Mismatch: given representation in specific medium vs. semantic description of information (semantic gap) Scale - How Much Information is out there? • World Wide Web Tens or hundreds billions of documents? Approx. 10KB/doc of 100s of TB • Then there is everything else Email, personal files, proprietary databases, • • • broadcast media, print Estimated 5 Exabytes p.a. (growing at 30%) 800 MB p.a. and person Web is just a tiny starting point…. IR problem It is mainly dealing with a very large , mostly unstructured data set IR problem consists of : building efficient indexes. processing user queries with high performance. improve ‘quality’ of answer set. Basic Concepts • Information retrieval is directly affected by the : – User Tasks – Document Logical view User Tasks • Interaction of the user with retrieval system. Retrieval Documents Browsing User Tasks • Classical information retrieval system allows IR • Hypertext system are usually tuned for quick Browsing. • Modern digital lib. and Web interfacing might attempt to combine these tasks. Logical view of the document • Documents are represented either by Keywords or Indexes is known as logical view of the documents. • Keywords are either extracted directly from the text of document or specified by human. • Modern computers represents doc by its set of : – Full words. – Small words. • Stopwords : elimination of articles and connectives. • steaming : (reduces distinct words to their common grammatical roots.) Introduction… • Information Retrieval System: Feedback Sample retrieval Queries Processor Input Output Documents A typical IR system 28 Introduction… • Information Retrieval System: – Input: Store only a representation of the document (or query) which means that the text of a document is lost once it has been processed for the purpose of generating its representation. – A document representative could be a list of extracted words considered to be significant. – The user has to use the language in which he/she can express the needed information in the language. – Processor: Involve in performing actual retrieval function, executing the search strategy in response to a query. – Feedback: Improving the subsequent run after a sample retrieval. – Output:A set of document numbers. And the evaluation can be done. 29 Information Retrieval Process Information need text input Parse Introduction How is the query constructed? Pre-process Query Index Rank Collections How is the text processed? Definitions • Searching: Seeking for specific information within a body of information. The result of a search is a set of hits. • Browsing: Unstructured exploration of a body of information. • Linking: Moving from one item to another following links, such as citations, references, etc. The Basics of Information Retrieval Query: A string of text, describing the information that the user is seeking. Each word of the query is called a search term. A query can be a single search term, a string of terms, a phrase in natural language, or a stylized expression using special symbols. Full text searching: Methods that compare the query with every word in the text, without distinguishing the function of the various words. Fielded searching: Methods that search on specific bibliographic or structural fields, such as author or heading. SORTING AND RANKING HITS When a user submits a query to a search system, the system returns a set of hits. With a large collection of documents, the set of hits maybe very large. The value to the use depends on the order in which the hits are presented. Three main methods: • Sorting the hits, e.g., by date • Ranking the hits by similarity between query and document • Ranking the hits by the importance of the documents Examples of Search Systems Find file on a computer system (Spotlight for Macintosh). Library catalog for searching bibliographic records about books and other objects (Library of Congress catalog). Abstracting and indexing system for finding research information about specific topics (Medline for medical information). Web search service for finding web pages (Google).