IT-522: Web Databases And Information Retrieval Lecture - 2 By Dr. Syed Noman Hasany IR System Architecture User Interface Text User Need Text Operations Logical View User Feedback Query Query Operations Searching Indexing Index Database Manager Inverted file Text Database Ranked Docs Ranking Retrieved Docs IR System Components • Searching retrieves documents that contain a given query token from the inverted index. • Ranking scores all retrieved documents according to a relevance metric. • User Interface manages interaction with the user: – Query input and document output. – Relevance feedback. – Visualization of results. • Query Operations transform the query to improve retrieval: – Query expansion using a thesaurus. – Query transformation using relevance feedback. Assignment 1a • Describe the IR system architecture in detail with sequence of operations performed in relation to information retrieval. Document Preprocessing Basic text processing techniques applied before an index is created • Tokenize document • Remove stop words (also known as noise words) • Perform stemming (saves indexing space and improves recall) Create Index: data about remaining keywords is then stored in a postings data structure Tokenization • Analyze text into a sequence of discrete tokens (words). • Sometimes punctuation (e-mail), numbers (1999), and case (Republican vs. republican) can be a meaningful part of a token. – However, frequently they are not. • Simplest approach is to ignore all numbers and punctuation and use only caseinsensitive unbroken strings of alphabetic characters as tokens. Tokenizing HTML • Should text in HTML commands not typically seen by the user be included as tokens? – Words appearing in URLs. – Words appearing in “meta text” of images. • Simplest approach is to exclude all HTML tag information (between “<“ and “>”) from tokenization. • More difficult to work with some formats like .ps / .pdf Stopwords • Excluding high-frequency words (e.g. function words: “a”, “the”, “in”, “to”; pronouns: “I”, “he”, “she”, “it”). • Stopwords are language dependent • For efficiency, store strings for stopwords in a hashtable to recognize them in constant time. • How to determine a list of stopwords? – For English? – may use existing lists of stopwords • WordNet stopword list – For Spanish? Bulgarian? Lemmatization • Reduce inflectional/variant forms to base form • E.g., – am, are, is be – car, cars, car's, cars' car • the boy's cars are different colors the boy car be different color • How to do this? – Need a list of grammatical rules + a list of irregular words – Children child, spoken speak … – Practical implementation: use WordNet’s morphstr function Stemming • Reduce tokens to “root” form of words to recognize morphological variation. – “computer”, “computational”, “computation” all reduced to same token “compute” • Correct morphological analysis is language specific and can be complex. • Stemming “blindly” strips off known affixes (prefixes and suffixes) in an iterative fashion. for example compressed and compression are both accepted as equivalent to compress. for exampl compres and compres are both accept as equival to compres. Porter Stemmer • Simple procedure for removing known affixes in English without using a dictionary. • Can produce unusual stems that are not English words: – “computer”, “computational”, “computation” all reduced to same token “comput” • May conflate (reduce to the same token) words that are actually distinct. • Not recognize all morphological derivations. Typical rules in Porter • • • • sses ss ies i ational ate tional tion Term-document incidence Which plays/docs of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia? Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Antony 1 1 0 0 0 1 Brutus 1 1 0 1 0 0 Caesar 1 1 0 1 1 1 Calpurnia 0 1 0 0 0 0 Cleopatra 1 0 0 0 0 0 mercy 1 0 1 1 1 1 worser 1 0 1 1 1 0 Brutus AND Caesar but NOT Calpurnia 1 if doc contains word, 0 otherwise Incidence vectors • So we have a 0/1 vector for each term. • To answer query: take the vectors for Brutus, Caesar and Calpurnia (complemented) bitwise AND. • 110100 AND 110111 AND 101111 = 100100. Bigger corpora • Consider N = 1M documents, each with about 1K terms. • Avg 6 bytes/term including spaces/punctuation – 6GB of data in the documents. • Say there are m = 500K distinct terms among these. Assignment 1b • 500K x 1M matrix has half-a-trillion 0’s and 1’s. • But it has no more than one billion 1’s. – matrix is extremely sparse. • What’s a better representation? – We only record the 1 positions. • Million=106 , Billion=109 , Trillion=1012 Assignment 1b Why? Postings Data Structures • Data about keywords, documents and especially the occurrences of keywords in documents, needs to be stored in an appropriate postings data structure: this is the index in an IR system • The details of what data needs to be stored depend on the required functionality of the application • Aim (as with all data structures) is to facilitate efficient access to and processing of stored data So, what’s wrong with this? Doc1: car, traffic, lorry, fruit, roads… Doc2: boat, river, traffic, vegetables… Doc3: train, bread, railways… … … Doc1,000,000: car, roads, traffic, delays… Inefficient retrieval! (search a term in each document) Inverted index • Indexing is a process by which a vocabulary of keywords is assigned to all documents of a corpus • For each vocabulary item there is a list of pointers to occurrences • Speeds up searching for occurrences of words Inverted index • Maintain a list of docs for each term • For each term T, we must store a list of all documents that contain T. • Do we use an array or a list for this? Brutus 2 Calpurnia 1 Caesar 13 4 2 8 16 32 64 128 3 5 8 13 16 What happens if the word Caesar is added to document 14? 21 34 Inverted index • Linked lists generally preferred to arrays +Dynamic space allocation +Easy insertion of terms into posting lists −Space overhead of pointers Brutus 2 4 8 16 Calpurnia 1 2 3 5 Caesar 13 Dictionary 32 8 Posting 64 13 16 Postings lists 128 21 34 Inverted index construction Documents to be indexed. Friends, Muslims, countrymen. Tokenizer Token stream. Friends Muslims Countrymen friend muslim countryman Linguistic modules Modified tokens. Indexer Inverted index. friend 2 4 muslim 1 2 countryman 13 16 Indexer steps • Sequence of (Modified token, Document ID) pairs. Doc 1 I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me. Doc 2 So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious Term I did enact julius caesar I was killed i' the capitol brutus killed me so let it be with caesar the noble brutus hath told you Doc # 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 caesar 2 was ambitious 2 2 • Sort by terms. Term Doc # I did enact julius caesar I was killed i' the capitol brutus killed me so let it be with caesar the noble brutus hath told you caesar was ambitious 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 Term Doc # ambitious be brutus brutus capitol caesar caesar caesar did enact hath I I i' it julius killed killed let me noble so the the told you was was with 2 2 1 2 1 1 2 2 1 1 1 1 1 1 2 1 1 1 2 1 2 2 1 2 2 2 1 2 2 • Multiple term entries in a single document are merged. • Frequency information is added. Term Doc # ambitious be brutus brutus capitol caesar caesar caesar did enact hath I I i' it julius killed killed let me noble so the the told you was was with 2 2 1 2 1 1 2 2 1 1 1 1 1 1 2 1 1 1 2 1 2 2 1 2 2 2 1 2 2 Term Doc # ambitious be brutus brutus capitol caesar caesar did enact hath I i' it julius killed let me noble so the the told you was was with 2 2 1 2 1 1 2 1 1 2 1 1 2 1 1 2 1 2 2 1 2 2 2 1 2 2 Term freq 1 1 1 1 1 1 2 1 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 • The result is split into a Dictionary file and a Postings file. So there must be one entry for each term in Dictionary. Term Doc # ambitious be brutus brutus capitol caesar caesar did enact hath I i' it julius killed let me noble so the the told you was was with Freq 2 2 1 2 1 1 2 1 1 2 1 1 2 1 1 2 1 2 2 1 2 2 2 1 2 2 1 1 1 1 1 1 2 1 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 Doc # Term N docs Coll freq ambitious 1 1 be 1 1 brutus 2 2 capitol 1 1 caesar 2 3 did 1 1 enact 1 1 hath 1 1 I 1 2 i' 1 1 it 1 1 julius 1 1 killed 1 2 let 1 1 me 1 1 noble 1 1 so 1 1 the 2 2 told 1 1 you 1 1 was 2 2 with 1 1 Freq 2 2 1 2 1 1 2 1 1 2 1 1 2 1 1 2 1 2 2 1 2 2 2 1 2 2 1 1 1 1 1 1 2 1 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 • The pointers in the structure Doc # Terms Prasad Freq 2 2 1 2 1 1 2 1 1 2 1 1 2 1 1 2 1 2 2 1 2 2 2 1 2 2 N docs Coll freq Term 1 1 ambitious 1 1 be 2 2 brutus 1 1 capitol 3 2 caesar 1 1 did 1 1 enact 1 1 hath 2 1 I 1 1 i' 1 1 it 1 1 julius 2 1 killed 1 1 let 1 1 me 1 1 noble 1 1 so 2 2 the 1 1 told 1 1 you 2 2 was 1 1 with Pointers 1 1 1 1 1 1 2 1 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 27 How Good is an IR System? • We need ways to measure how good an IR systems is, i.e. evaluation metrics • Systems should return relevant information items (texts, images, etc); systems may rank the items in order of relevance Two ways to measure the performance of an IR system: Precision = “how many of the retrieved items are relevant?” Recall = “how many of the items that should have been retrieved were retrieved?” • These should be objective measures. • Both require humans to make decisions about what documents are relevant for a given query Calculating Precision and Recall R = number of documents in collection relevant to topic t A(t) = number of documents returned by system in response to query t C = number of ‘correct’ (relevant) documents returned, i.e. the intersection of R and A(t) PRECISION = ((C+1)/(A(t)+1))*100 RECALL = ((C+1)/(R+1))*100 Assignment 1c • Amanda and Alex each need to choose an information retrieval system. Amanda works for an intelligence agency, so getting all possible information about a topic is important for the users of her system. Alex works for a newspaper, so getting some relevant information quickly is more important for the journalists using his system. • See below for statistics for two information retrieval systems (Search4Facts and InfoULike) when they were used to retrieve documents from the same document collection in response to the same query: there were 100,000 documents in the collection, of which 50 were relevant to the given query. Which system would you advise Amanda to choose and which would you advise Alex to choose? Your decisions should be based on the evaluation metrics of precision and recall. Search4Facts • • Number of Relevant Documents Returned = 12 Total Number of Documents Returned = 15 • • Number of Relevant Documents Returned = 48 Total Number of Documents Returned = 295 InfoULike