CSE 535 Information Retrieval Chapter 1: Introduction to IR Srihari-CSE535-Spring2008 Motivation IR: representation, storage, organization of, and access to unstructured data Focus is on the user information need User information need: When did the Buffalo Bills last win the Super Bowl? Find all docs containing information on cricket players who are: (1) tempermental, (ii) popular in their countries, and (iii) play in international test series. Emphasis is on the retrieval of information (not data) Srihari-CSE535-Spring2008 Motivation Data retrieval which docs contain a set of keywords? Well defined semantics a single erroneous object implies failure! Information retrieval information about a subject or topic deals with unstructured text semantics is frequently loose small errors are tolerated IR system: interpret contents of information items generate a ranking which reflects relevance notion of relevance is most important Srihari-CSE535-Spring2008 Basic Concepts The User Task Retrieval Database Browsing Retrieval information or data purposeful needle in a haystack problem Browsing glancing around Formula 1 racing; cars, Le Mans, France, tourism Filtering (push rather than pull) Srihari-CSE535-Spring2008 Query Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia? Could grep all of Shakespeare’s plays for Brutus and Caesar, then strip out lines containing Calpurnia? Slow (for large corpora) NOT Calpurnia is non-trivial Other operations (e.g., find the phrase Romans and countrymen) not feasible Srihari-CSE535-Spring2008 Term-document incidence Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Antony 1 1 0 0 0 1 Brutus 1 1 0 1 0 0 Caesar 1 1 0 1 1 1 Calpurnia 0 1 0 0 0 0 Cleopatra 1 0 0 0 0 0 mercy 1 0 1 1 1 1 worser 1 0 1 1 1 0 1 if play contains word, 0 otherwise Srihari-CSE535-Spring2008 Incidence vectors So we have a 0/1 vector for each term. To answer query: take the vectors for Brutus, Caesar and Calpurnia (complemented) bitwise AND. 110100 AND 110111 AND 101111 = 100100. Srihari-CSE535-Spring2008 Answers to query Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar dead, He cried almost to roaring; and he wept When at Philippi he found Brutus slain. Hamlet, Act III, Scene ii Lord Polonius: I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me. Srihari-CSE535-Spring2008 Bigger document collections Consider N = 1million documents, each with about 1K terms. Avg 6 bytes/term incl spaces/punctuation 6GB of data in the documents. Say there are M = 500K distinct terms among these. Srihari-CSE535-Spring2008 Can’t build the matrix 500K x 1M matrix has half-a-trillion 0’s and 1’s. Why? But it has no more than one billion 1’s. matrix is extremely sparse: >99% zeros What’s a better representation? We only record the 1 positions. Inverted Index Srihari-CSE535-Spring2008 Ad-Hoc Retrieval Most standard IR task System to provide documents from the collection that are relevant to an arbitrary user information need Information need: topic that user wants to know about Query: user’s abstraction of the information need Relevance: document is relevant if the user perceives it as valuable wrt his information need Srihari-CSE535-Spring2008 Issues to be Addressed by IR How to improve quality of retrieval Precison: what fraction of the returned results are relevant to information need? Recall: what fraction of relevant documents in the collection are returned by the system Understanding user information need Faster indexes and smaller query response times Better understanding of user behaviour interactive retrieval visualization techniques Srihari-CSE535-Spring2008 Inverted index For each term T: store a list of all documents that contain T. Do we use an array or a list for this? Brutus 2 Calpurnia 1 Caesar 4 2 8 16 32 64 128 3 5 8 13 21 34 13 16 What happens if the word Caesar is added to document 14? Srihari-CSE535-Spring2008 Inverted index Linked lists generally preferred to arrays Dynamic space allocation Insertion of terms into documents easy Space overhead of pointers Brutus 2 4 8 16 Calpurnia 1 2 3 5 Caesar 13 Dictionary 32 8 64 13 21 128 34 16 Postings Sorted by docID (more later on why). Srihari-CSE535-Spring2008 Inverted index construction Documents to be indexed. Friends, Romans, countrymen. Tokenizer Token stream. More on these later. Modified tokens. Inverted index. Friends Romans Countrymen Linguistic modules friend roman countryman Indexer friend 2 4 roman 1 2 13 16 countrymanSrihari-CSE535-Spring2008 Basic Concepts Logical view of the documents documents represented by a set of index terms or keywords Accents spacing Docs stopwords Noun groups stemming Automatic or Manual indexing Structure recognition structure Full text Index terms Document representation viewed as a continuum: logical view of docs might shift Srihari-CSE535-Spring2008 The Retrieval Process User Interface user need Text 2,3 Text Operations logical view logical view Query Operations user feedback Indexing Criteria, Preferences Indexer 9 query Searching 4,5,6 inverted file Index 7 retrieved docs Document Collection Ranking ranked docs 7,21 Retrieval Indexing Srihari-CSE535-Spring2008 Applications of IR Specialized Domains biomedical, legal, patents, intelligence Summarization Cross-lingual Retrieval, Information Access Question-Answering Systems Ask Jeeves Web/Text Mining data mining on unstructured text Multimedia IR images, document images, speech, music Web applications shopbots personal assistant agents Srihari-CSE535-Spring2008 IR Techniques Machine learning clustering, SVM, latent semantic indexing, etc. improving relevance feedback, query processing etc. Natural Language Processing, Computational Linguistics better indexing, query processing incorporating domain knowledge: e.g., synonym dictionaries use of NLP in IR: benefits yet to be shown for large-scale IR Information Extraction Highly focused Natural language processing (NLP) named entity tagging, relationship/event detection Text indexing and compression User interfaces and visualization AI advanced QA systems, inference, etc. Srihari-CSE535-Spring2008