Introduction to Information Retrieval 1 Definition • Information retrieval (IR) is the task of – finding material (usually documents) – of an unstructured nature (usually text) – that satisfies an information need (usually expressed as a query) – from within large collections (usually stored on computers). 2 Structured vs. Unstructured • Text has a natural (linguistic) structure – Sequential structure – Latent grammatical structure – Latent logical representations – Connections to knowledge bases • When we say ‘unstructured data’, we really mean, we’re just going to ignore all of that structure. 3 A more detailed look at the task Information retrieval can involve: • Filtering: Finding the set of relevant search results (Boolean retrieval) • Organizing – Often ranking – Can be more complicated: clustering the results, classifying into different categories, … • User interface – Can be simple html (eg, Google) – Can involve complex visualization techniques, to allow user to navigate the space of potentially relevant search results • In general, the system tries to reduce the user’s effort in finding exactly the information they’re looking for. 4 Preliminaries Terminology and Preprocessing 5 Terminology • Collection (or corpus): a set of documents • Word token (or just token): a sequence of characters in a document that constitutes a meaningful semantic unit • Word type (or just type): an equivalence class of tokens with the exact same sequence of characters • Index: a data structure used for efficient processing of large datasets • Inverted index: The name for the main data structure used in information retrieval • Term: An equivalence class of word types, perhaps transformed, used to build an inverted index. • Vocabulary: the set of distinct terms in an index. 6 Preprocessing a corpus 1. Tokenize: split a document into tokens – – Mostly easy in English hard in, e.g., Mandarin or Japanese 2. Morphological processing, which may include – – – – Stemming / lemmatization (removing inconsequential endings of words) Removing capitalization, punctuation, or diacritics and accents Identifying highly synonymous terms, or “equivalence classes” (e.g., Jan. and January) Removing very common word types (“stopwords”) 7 Tokenizing English • Easy rule (about 90% accurate): split on whitespace • Some special cases: – Contractions: I’m I + `m, Jack’s Jack + `s, aren’t are + n’t – Multi-word expressions: New York, Hong Kong – Hyphenation: “travel between San Francisco-Los Angeles” vs. co-occurrence or Hewlett-Packard. 8 Tokenization in other languages • Word segmentation: In East-Asian characterbased writing systems (Mandarin, Thai, Korean, Japanese, …) word boundaries are not indicated by whitespace. • In Germanic languages, words can often be compounded together to form very long words. • Many languages have special forms of contractions 9 The Inverted Index 10 Indexing • Indexing is a technique borrowed from databases • An index is a data structure that supports efficient lookups in a large data set – E.g., hash indexes, R-trees, B-trees, etc. 11 Inverted Index An inverted index stores an entry for every word, and a pointer to every document where that word is seen. Vocabulary term1 . . . termN Postings List Document17, Document 45123 Document991, Document123001 12 Example Document D1: “yes we got no bananas” Document D2: “Johnny Appleseed planted apple seeds.” Document D3: “we like to eat, eat, eat apples and bananas” Vocabulary yes we got no bananas Johnny Appleseed planted apple seeds like to eat and Postings List D1 D1, D3 D1 D1 D1, D3 D2 D2 D2 D2, D3 D2 D3 D3 D3 D3 Query “apples bananas”: “apples” “bananas” {D2, D3} {D1, D3} Whole query gives the intersection: {D2, D3} ^ {D1, D3} = {D3} 13 Variations Word-level index stores document IDs and offsets for the position of the words in each document • Why? – Supports phrase-based queries Vocabulary yes we got no bananas … eat … Postings List D1 (+0) D1 (+1), D3 (+0) D1 (+2) D1 (+3) D1 (+4), D3 (+8) D3 (+3,+4,+5) 14 Variations • Real search engines add all kinds of other information to their postings lists – for efficiency – to support better ranking of results Vocabulary yes (docs=1) we (docs=2) … eat (docs=1) … Postings List D1 (freq=1) D1 (freq=1), D3 (freq=1) D3 (freq=3) 15 Index Construction Algorithm: 1. Scan through each document, word by word - Write term, docID pair for each word to TempIndex file 2. Sort TempIndex by terms 3. Iterate through sorted TempIndex: merge all entries for the same term into one postings list. 16 Index Construction Algorithm: 1. Scan through each document, word by word - Write term, docID pair for each word to TempIndex file we D3 yes D1 we D1 got D1 no D1 bananas D1 Johnny D2 Appleseed D2 planted D2 apple D2 seeds D2 like D3 to D3 eat D3 eat D3 eat D3 apples D3 and D3 bananas D3 Document D1: “yes we got no bananas” Document D2: “Johnny Appleseed planted apple seeds.” Document D3: “we like to eat, eat, eat apples and bananas” 17 Index Construction Algorithm: 1. Scan through each document, word by word - Write term, docID pair for each word to TempIndex file yes D1 we D1 got D1 no D1 bananas D1 TempIndex: Johnny D2 Appleseed D2 planted D2 apple D2 seeds D2 we D3 like D3 to D3 eat D3 eat D3 eat D3 apples D3 and D3 bananas D3 18 Index Construction Algorithm: 2. Sort the TempIndex by Terms yes D1 we D1 got D1 no D1 bananas D1 TempIndex: Johnny D2 Appleseed D2 planted D2 apple D2 seeds D2 we D3 like D3 to D3 eat D3 eat D3 eat D3 apples D3 and D3 bananas D3 19 Index Construction Algorithm: 2. Sort the TempIndex by Terms TempIndex: and D3 apple D2 apple D3 Appleseed D2 bananas D1 bananas D3 eat D3 eat D3 eat D3 got D1 Johnny like no planted seeds to we we yes 20 D2 D3 D1 D2 D2 D3 D1 D3 D1 Index Construction Algorithm: 3. Merge postings lists for matching terms TempIndex: and D3 apple D2 apple D3 Appleseed D2 bananas D1 bananas D3 eat D3 eat D3 eat D3 got D1 Johnny like no planted seeds to we we yes 21 D2 D3 D1 D2 D2 D3 D1 D3 D1 Index Construction Algorithm: 3. Merge postings lists for matching terms TempIndex: and D3 apple D2, D3 Appleseed D2 bananas D1 bananas D3 eat D3 eat D3 eat D3 got D1 Johnny like no planted seeds to we we yes 22 D2 D3 D1 D2 D2 D3 D1 D3 D1 Index Construction Algorithm: 3. Merge postings lists for matching terms TempIndex: and D3 apple D2, D3 Appleseed D2 bananas D1, D3 eat D3 eat D3 eat D3 got D1 Johnny like no planted seeds to we we yes 23 D2 D3 D1 D2 D2 D3 D1 D3 D1 Index Construction Algorithm: 3. Merge postings lists for matching terms TempIndex: and D3 apple D2, D3 Appleseed D2 bananas D1, D3 eat D3 got D1 Johnny D2 Like D3 no planted seeds to we we yes 24 D1 D2 D2 D3 D1 D3 D1 Index Construction Algorithm: 3. Merge postings lists for matching terms Final Index: and D3 apple D2, D3 Appleseed D2 bananas D1, D3 eat D3 got D1 Johnny D2 Like D3 no D1 planted D2 seeds D2 to D3 we D1, D3 yes D1 25 Efficient Index Construction Problem: Indexes can be huge. How can we efficiently build them? - Blocked Sort-based Construction (BSBI) - Single-Pass In-Memory Indexing (SPIMI) What’s the difference? 26 Vector Space Model 27 Problem: Too many matching results for every query Using an inverted index is all fine and good, but if your document collection has 10^12 documents and someone searches for “banana”, they’ll get 90 million results. We need to be able to return the “most relevant” results. We need to rank the results. 28 Vector Space Model Idea: treat each document and query as a vector in a vector space. Then, we can find “most relevant” documents by finding the “closest” vectors. • But how can we make a document into a vector? • And how do we measure “closest”? 29 Example: Documents as Vectors Example: Document D1: “yes we got no bananas” Document D2: “what you got” Document D3: “yes I like what you got” yes we got no bananas what you I like Vector V1: 1 1 1 1 1 0 0 0 0 Vector V2: 0 0 1 0 0 1 1 0 0 Vector V3: 1 0 1 0 0 1 1 1 1 30 What about queries? The vector space model treats queries as (very short) documents. Example query: “you got” yes we Query Q1: 0 0 got no bananas what you I like 1 0 0 0 1 0 0 31 Measuring Similarity Similarity metric: the cosine of the angle between document vectors. “Cosine Similarity”: CS (v1 , v2 ) cos v1 ,v2 v1 v2 v1 v2 32 Why cosine? It gives some intuitive similarity judgments: cos (0) cos (π/4) cos (π/2) cos (π) = cos(45) = cos(90) = cos(180) 33 =1 = .71 =0 = -1 Example: Computing relevance yes we got no bananas what you I like Query Q1: 0 0 1 0 0 0 1 0 0 Vector V1: 1 1 1 1 1 0 0 0 0 Vector V2: 0 0 1 0 0 1 1 0 0 Vector V3: 1 0 1 0 0 1 1 1 1 CS (q1 , v1 ) q1 v1 1 q1 v1 2 5 CS (q1 , v2 ) q1 v2 2 q1 v2 2 3 34 CS (q1 , v3 ) q1 v3 2 q1 v3 2 6 Relevance Ranking in the Vector Space Model Definition: relevance between query q and document d in the VSM is: rel ( q, d ) vq vd vq vd For a given query, the VSM ranks documents from largest to smallest relevance scores. 35 All words are equal? Our example so far has converted documents to boolean vectors, where each dimension indicates whether a term is present or not. Problems: • If a term appears many times in a document, it should probably count more. • Some words are more “informative” than others (eg, stop words are not informative) • Longer documents contain more terms, and will therefore be considered relevant to more queries, but perhaps not for good reason. 36 Weighting Scheme • A weighting scheme is a technique for converting documents to real-valued vectors (rather than boolean vectors). • The real value for a term in a document is called the weight of that term in that document. – We’ll write this as wt,d 37 Term Frequency Weighting • Term Frequency (TF) weighting is a heuristic that sets the weight of a term to be the number of times it appears in the document. w count(t, d ) tf t ,d 38 Log-TF • A common variant of TF is to reduce the effect of high-frequency terms by taking the logarithm: log- tf t ,d w 1 log(count(t , d )) count(t , d ) 1 0 count(t , d ) 0 39 Inverse Collection Frequency Weighting • Inverse Collection Frequency (ICF) weighting is a heuristic that sets the weight of a term to be reduced by the number of times it appears in the whole collection. – This makes really common words (eg, “the”) have really low weight. icf t ,d w T log count(t , C ) Where T is the total number of tokens in collection C. 40 Inverse Document Frequency Weighting • Inverse Document Frequency (IDF) weighting is a heuristic that sets the weight of a term to be reduced by the number of documents it appears in the whole collection. – This makes really common words (eg, “the”) have really low weight. idf t ,d w N log count(d ' | t d ' ) Where N is the total number of documents in collection C. 41 ICF vs. IDF Word CF DF ICF IDF try 10422 8760 0.3010 0.1632 insurance 10440 3997 0.3007 0.5040 Nobody uses ICF! – Informative words tend to “clump together” in documents – As a result, words may appear many times, but only in a few documents. – This may make their CF large, while their DF isn’t as big. – In practice, IDF is better at discriminating between “informative” and “non-informative” terms. 42 Term Frequency – Inverse Document Frequency TF-IDF weighting combines TF and IDF (duh) tf -idf t ,d w w w tf t ,d idf t ,d N count(t , d ) log count(d ' | t d ' ) Probably the most common weighting scheme in practice. 43 Limitations The vector space model has the following limitations: • Long documents are poorly represented because they have poor similarity values. • Semantic sensitivity: documents with similar context but different term vocabulary won't be associated, resulting in a false negative match. • The order in which the terms appear in the document is lost in the vector space representation. 44 Language Models for Ranking 45 VSM and heuristics • One complaint about the VSM is that it relies too heavily on ad-hoc heuristics – Why should similarity be cosine similarity, as opposed to some other similarity function? – TF-IDF seems like a hack • Can we formulate a more principled approach that works well? 46 Language Models • In theory classes, you’ve all seen models that “accept” or “reject” a string: – Finite automata (deterministic and non-) – Context-sensitive and context-free grammars – Turing machines • Language models are probabilistic versions of these models – Instead of saying “yes” or “no” to each string, they assign a probability of acceptance 47 Language Model Definition • Let V be the vocabulary (set of word types) for a language. • A language model is a distribution P() over V*, the set of sequences made from V. 48 Example Language Models • Really simple: If string s contains “the”, P(s) = 1, otherwise, P(s) = 0.* • Slightly less simple: p(" the" ) count (" the" , C ) T If string s contains “the”, let P(s) = p(“the”). • More sophisticated: P( s) 49 p(t ) terms ts *this is not a proper distribution, but for ranking, we’re not going to be too picky. Ranking with a Language Model We can use a language model to rank documents for a given query, by determining: P(d | q) and ranking according to the probability scores. 50 Query Likelihood • Most (but not all) techniques apply Bayes’ Rule to this score: P( q | d ) P( d ) P( d | q ) P(q) • Since P(q) doesn’t depend on the document, it’s not interesting for ranking, so we can get rid of it. • Most systems ignore P(d) as well (although there are good reasons for not ignoring it). • What’s left is called the query likelihood: P(q | d) 51 Language Models for Query Likelihood To compute the query likelihood: 1. Systems automatically learn a language model PLM(d) from each document d 2. They then set P(q | d) = PLM(d)(q). Two questions: 1. What language model to use? 2. How to automatically learn it from d? 52 Multinomial Model Multinomial models are the standard choice for LMs for information retrieval. Urn (language model) 53 Document Multinomial Model This model generates one word at a time (with replacement), with probability equal to the fraction of balls with that color in the urn. Urn (language model) 54 Document Multinomial Model This model generates one word at a time (with replacement), with probability equal to the fraction of balls with that color in the urn. Urn (language model) 55 Document Multinomial Model This model generates one word at a time (with replacement), with probability equal to the fraction of balls with that color in the urn. Urn (language model) 56 Document Multinomial Model The multinomial model throws out the sequence information … Urn (language model) 6 57 9 3 4 0 Document 0 2 Multinomial Model Question: what’s the probability of observing a document with these particular frequencies for the words? Urn (language model) 6 58 9 3 4 0 Document 0 2 Multinomial Model Question: what’s the probability of observing a document with these particular frequencies for the words, given our language model? Answer: Let pw be the probability of word w, according to the language model (frequencies in the urn). And let cw be the count of word w in the document. length(d )! cw P LM (d ) ( pw ) cw! wV wV 59 Simplifications • Starting with length(d )! cw P LM (d ) ( pw ) cw! wV wV • Since we’re only going to compute PLM(q), and q’s length is a constant for a given query, we can simplify to: P LM (q) ( pw ) wV 60 cw Learning a Multinomial Model • To use the multinomial model for ranking, we need to be able to compute PLM(d)(q). – That’s the probability of query q, according to the language model we learned from document d. • But what is PLM(d)? – We need to determine all of the parameters of LM. – For the multinomial model, this means we need to find pw for each word w. 61 Maximum Likelihood Parameter Estimation • To find good settings of pw from document d, we’ll use a technique called maximum likelihood estimation (MLE) – This means, set pw to the value that makes PLM(d)(d) biggest (most likely). • It’s possible to prove (on board) that the cw MLE for pw is: c w 'V 62 w' Missing Words • Let’s say you’re searching for “cheap digital cameras” and document d is all about this topic, but never mentions “digital” explicitly. • What’s PLM(d)(q)? – p“digital”=0 – therefore, PLM(d)(q) = 0 – So, this document will get ranked at the bottom. 63 Smoothing • Mission: to allow documents that are missing query words to have non-zero query likelihood. • Method: 1. construct a “backoff” language model, estimated from the whole corpus: PLM(C)() 2. Combine PLM(C) and PLM(d): P’LM(d)(q) = αPLM(d)(q) + (1- α)PLM(C)(q) 64 Experimental Results 65 Comparison between LM and VSM • Advantages of LM: – Experimentally, seems to be more accurate – Mathematically, more principled • Multinomial model directly incorporates notions of TF and IDF • Disadvantages: – It’s “harder” to incorporate some of the other techniques that are widely used in information retrieval (which we’re not going to discuss) • Relevance feedback • Personalization – It can be computationally expensive 66 Homework • Gather a corpus of documents – Approximately 1 million tokens worth of text – Divide it into distinct “documents” – Tokenize it • Potential sources – Wikipedia – News sites or web portals (like yahoo) – Collections of literature (see, for instance, the American National Corpus) 67 – Web crawl