IST 511 Information Management: Information and Technology Text Processing and Information Retrieval Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology Professor of Computer Science and Engineering Professor of Supply Chain and Information Systems The Pennsylvania State University, University Park, PA, USA giles@ist.psu.edu http://clgiles.ist.psu.edu Special thanks to C. Manning, P. Raghavan, H. Schutze, M. Hearst Last Time What are networks – Definitions – Theories – Social networks Why do we care Impact on information science Today What is text – Definitions – Why important – What is text processing What is information retrieval? – – – – – – Definition of a document Text and documents as vectors Relevance (performance) – recall/precision Inverted index Similarity/ranking ACM SIGIR (list serve) Impact on information science – Foundations of SEARCH ENGINES Tomorrow Topics used in IST Machine learning Linked information and search Encryption Probabilistic reasoning Digital libraries Others? Theories in Information Sciences Enumerate some of these theories in this course. Issues: – Unified theory? – Domain of applicability – Conflicts Theories here are – Mostly algorithmic – Some quite qualitative (qualitative vs algorithmic?) Quality of theories – Occam’s razor – Subsumption of other theories (any subsumptions so far?) Theories of text and information retrieval – Theories are mostly representational What is information retrieval (IR) Automatic document retrieval usually based on a query Involves: • Documents • Document parsing • Vector model (theory!) • Inverted index • Query processing • Similarity/Ranking • Evaluation History of information retrieval History of Information Retrieval USA Ancient libraries to the web Always a part of information sciences Early ideas of IR/search Vannevar Bush - Memex - 1945 "A memex is a device in which an individual stores all his books, records, and communications, and which is mechanized so that it may be consulted with exceeding speed and flexibility. It is an enlarged intimate supplement to his memory.” Bush seems to understand that computers won’t just store information as a product; they will transform the process people use to produce and use information. Some IR History – Roots in the scientific “Information Explosion” following WWII – Interest in computer-based IR from mid 1950’s • • • • • • • • • Key word indexing H.P. Luhn at IBM (1958) Probabilistic models at Rand (Maron & Kuhns) (1960) Boolean system development at Lockheed (‘60s) Medlars (‘60s) Vector Space Model (Salton at Cornell 1965) Statistical Weighting methods and theoretical advances (‘70s) Lexis Nexis (‘70s) Refinements and Advances in application (‘80s) User Interfaces, Large-scale testing and application (‘90s) – Then came the web and search engines and everything changed – More History IR and Search Engines Search engines are an IR application (instantiation). Search engines have become the most popular IR tools. Why? Query Engine Index Interface Indexer Users Crawler Web A Typical Web Search Engine Query Engine Index Interface Indexer Users Document Collection A Typical Information Retrieval System Query Engine Index Interface Tokenization Indexer Users Document Collection A Typical Information Retrieval System What is Text? Text is so common that we often ignore it’s importance What is text? – Strings of characters (alphabets, ideograms, ascii, unicode, etc.) • Words •.,:;-()_ • 1 2 3 3.1415 1010 • f = ma H20 • Tables • Figures – Anything that is not an image, etc. – Why is text important? • Text is language capture – an instantiation of language, culture, science, etc. Corpora? A collection of texts especially if complete and self contained; the corpus of Anglo-Saxon verse In linguistics and lexicography, a body of texts, utterances or other specimens considered more or less representative of a language and usually stored as an electronic database (The Oxford Companion to the English Language 1992) A collection of naturally occurring language text chosen to characterize a state or variety of a language (John Sinclair Corpus Concordance Collocation OUP 1991) Types: – Written vs Spoken – General vs Specialized • e.g. ESP, Learner corpora – Monolingual vs Multilingual • e.g. Parallel, Comparable – Synchronic vs Diachronic • At a particular point in time or over time – Annotated vs Unannotated Written corpora Brown LOB Time of compilation 1960s 1970s Compiled at Brown University (US) Lancaster, Oslo, Bergen Language variety Written American English Written British English Size 1 million words (500 texts of 2000 words each) Design Balanced corpora; 15 genres of text, incl. press reportage, editorials, reviews, religion, government documents, reports, biographies, scientific writing, fiction Focus on documents • Fundamental unit to be indexed, queried and retrieved. • Deciding what is an individual document can vary depending on problem • Documents are units usually consisting of text • Text is turned into tokens or terms • Terms or tokens are words or roots of words, semantic units or phrases which are the atoms of indexing. • Repositories (databases) are collections of documents. • Query is a request for documents on a query-related topic Definitions Collections consist of Documents Document – The basic unit which we will automatically index • usually a body of text which is a sequence of terms – has to be digital Tokens or terms – Basic units of a document, usually consisting of text • semantic word or phrase, numbers, dates, etc Collections or repositories – particular collections of documents – sometimes called a database Query – request for documents on a topic Collection vs documents vs terms Collection Document Terms or tokens What is a Document? A document is a digital object with an operational definition – Much of what you deal with daily! – Indexable – Can be queried and retrieved. Many types of documents – Text or part of text – Image – Audio – Video – Data – Email – Etc. Text Processing & Tokenizaton Standard Steps: • Recognize document structure • titles, sections, paragraphs, etc. • Tokenization • Break into tokens – type of markup • Tokens are delimited text; ex: – Hello, how are you. – _hello_,_how_are_you_._ • usually space and punctuation delineated • special issues with Asian languages – Stemming/morphological analysis – Store in inverted index Basic indexing pipeline Documents to be indexed. Friends, Romans, countrymen. Tokenizer Token stream. Friends Romans Countrymen Linguistic modules Modified tokens. Inverted index. friend roman countryman Indexer friend 2 4 roman 1 2 countryman 13 16 Parsing a document What format is it in? – pdf/word/excel/html? What language is it in? What character set is in use? Each of these is a classification problem which can be solved using heuristics or ML methods. But there are complications … Format/language stripping Documents being indexed can include docs from many different languages – A single index may have to contain terms of several languages. Sometimes a document or its components can contain multiple languages/formats – French email with a Portuguese pdf attachment. What is a unit document? – An email? – With attachments? – An email with a zip containing documents? Tokenization Parsing (chopping up) the document into basic units that are candidates for later indexing What to use and what not Issues with – – – – – – – Punctuation Numbers Special characters Equations Formula Languages Normalization (often by stemming) Tokenization Input: “Friends, Romans and Countrymen” Output: Tokens – Friends – Romans – Countrymen Each such token is now a candidate for an index entry, after further processing But what are valid tokens to emit? Tokenization Issues in tokenization: – Finland’s capital Finland? Finlands? Finland’s? – Hewlett-Packard Hewlett and Packard as two tokens? • State-of-the-art: break up hyphenated sequence. • co-education ? • the hold-him-back-and-drag-him-away-maneuver ? – San Francisco: one token or two? How do you decide it is one token? Numbers 3/12/91 Mar. 12, 1991 55 B.C. B-52 My PGP key is 324a3df234cb23e 100.2.86.144 – Generally, don’t index as text. – Will often index “meta-data” separately • Creation date, format, etc. Tokenization: Language issues L'ensemble one token or two? – L ? L’ ? Le ? – Want ensemble to match with un ensemble German noun compounds are not segmented – Lebensversicherungsgesellschaftsangestellter – ‘life insurance company employee’ Tokenization: language issues Chinese and Japanese have no spaces between words: – Not always guaranteed a unique tokenization Further complicated in Japanese, with multiple alphabets intermingled – Dates/amounts in multiple formats フォーチュン500社は情報不足のため時間あた$500K(約6,000万円) Katakana Hiragana Kanji “Romaji” End-user can express query entirely in hiragana! Tokenization: language issues Arabic (or Hebrew) is basically written right to left, but with certain items like numbers written left to right Words are separated, but letter forms within a word form complex ligatures . عاما من االحتالل الفرنسي132 بعد1962 استقلت الجزائر في سنة ← → ←→ ← start ‘Algeria achieved its independence in 1962 after 132 years of French occupation.’ With Unicode, the surface presentation is complex, but the stored form is straightforward Normalization Need to “normalize” terms in indexed text as well as query terms into the same form – We want to match U.S.A. and USA We most commonly implicitly define equivalence classes of terms – e.g., by deleting periods in a term Alternative is to do limited expansion: – Enter: window Search: window, windows – Enter: windows Search: Windows, windows – Enter: Windows Search: Windows Potentially more powerful, but less efficient Linguistic Modules Stemming and Morphological Analysis Goal: “normalize” similar words Morphology (“form” of words) – Inflectional Morphology • E.g,. inflect verb endings and noun number • Never change grammatical class – dog, dogs – Derivational Morphology • Derive one word from another, • Often change grammatical class – build, building; health, healthy Lemmatization Reduce inflectional/variant forms to base form E.g., – am, are, is be – car, cars, car's, cars' car the boy's cars are different colors the boy car be different color Lemmatization implies doing “proper” reduction to dictionary headword form Documents / bag of words A textual document is a digital object consisting of a sequence of words and other symbols, e.g., punctuation. Take out the sequence and one has a “bag of words” The individual words and other symbols are known as tokens or terms. A textual document can be: • Free text, also known as unstructured text, which is a continuous sequence of tokens. • Fielded text, also known as structured text, in which the text is broken into sections that are distinguished by tags or other markup. Word Frequency Observation: Some words are more common than others. Statistics: Most large collections of text documents have similar statistical characteristics. These statistics: • influence the effectiveness and efficiency of data structures used to index documents • many retrieval models rely on them Treat each document as a bag of words Word Frequency Example The following example is taken from: Jamie Callan, Characteristics of Text, 1997 Sample of 19 million words The next slide shows the 50 commonest words in rank order (r), with their frequency (f). f the 1,130,021 of 547,311 to 516,635 a 464,736 in 390,819 and 387,703 that 204,351 for 199,340 is 152,483 said 148,302 it 134,323 on 121,173 by 118,863 as 109,135 at 101,779 mr 101,679 with 101,210 f from 96,900 he 94,585 million 93,515 year 90,104 its 86,774 be 85,588 was 83,398 company83,070 an 76,974 has 74,405 are 74,097 have 73,132 but 71,887 will 71,494 say 66,807 new 64,456 share 63,925 f or about market they this would you which bank stock trade his more who one their 54,958 53,713 52,110 51,359 50,933 50,828 49,281 48,273 47,940 47,401 47,310 47,116 46,244 42,142 41,635 40,910 Rank Frequency Distribution For all the words in a collection of documents, for each word w f is the frequency that w appears r is rank of w in order of frequency. (The most commonly occurring word has rank 1, etc.) f w has rank r and frequency f 1 r Zipf Distribution (linear and log scale) Zipf's Law If the words, w, in a collection are ranked, r, by their frequency, f, they roughly fit the relation: r*f=c or r = c * 1/f Different collections have different constants c. In English text, c tends to be about n / 10, where n is the number of word occurrences in the collection. For a weird but wonderful discussion of this and many other examples of naturally occurring rank frequency distributions, see: Zipf, G. K., Human Behaviour and the Principle of Least Effort. Addison-Wesley, 1949 Methods that Build on Zipf's Law The Important Points: –a few elements occur very frequently –a medium number of elements have medium frequency –many elements occur very infrequently Stop lists: Ignore the most frequent words (upper cut-off). Used by almost all systems. Significant words: Ignore the most frequent and least frequent words (upper and lower cut-off). Rarely used. Term weighting: Give differing weights to terms based on their frequency, with most frequent words weighed less. Used by almost all ranking methods. Definitions Corpus: A collection of documents that are indexed and searched together. Word list: The set of all terms that are used in the index for a given corpus (also known as a vocabulary file). With full text searching, the word list is all the terms in the corpus, with stop words removed. Related terms may be combined by stemming. Controlled vocabulary: A method of indexing where the word list is fixed. Terms from it are selected to describe each document. Keywords: A name for the terms in the word list, particularly with controlled vocabulary. Information Retrieval Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually using computational methods). Basic Assumptions – Collection: Fixed set of documents – Goal: Retrieve documents with information that is relevant to the user’s information need and helps the user complete a task IR vs. databases: Structured vs unstructured data Structured data tends to refer to information in “tables” Employee Manager Salary Smith Jones 50000 Chang Smith 60000 Ivy Smith 50000 Typically allows numerical range and exact match (for text) queries, e.g., Salary < 60000 AND Manager = Smith. Unstructured data Typically refers to free text Allows – Keyword queries including operators – More sophisticated “concept” queries e.g., • find all web pages dealing with drug abuse Classic model for searching text documents Database and IR: How are they different Structured Data vs. unstructured data Relevance managed by the user vs. Relevance managed by the system (at least partially) Relevance measure: binary vs. scaled Unstructured (text) vs. structured (database) data in 2009 Terms & tokens Terms are what happens to text after tokenization and linguistic processing. – Examples • • • • Computer -> computer knowledge -> knowledg The -> the Removal of stop words? TERM VECTOR SPACE Term vector space n-dimensional space, where n is the number of different terms/tokens used to index a set of documents. Vector Document i, di, represented by a vector. Its magnitude in dimension j is wij, where: wij > 0 wij = 0 if term j occurs in document i otherwise wij is the weight of term j in document i. A Document Represented in a 3-Dimensional Term Vector Space t3 d1 t13 t2 t11 t12 t1 Basic Method: Incidence Matrix (Binary Weighting) document d1 d2 d3 text ant ant bee dog bee dog hog dog ant dog cat gnu dog eel fox terms ant bee ant bee dog hog cat dog eel fox gnu ant bee cat dog eel fox gnu hog d1 1 1 d2 1 1 d3 1 1 1 1 1 1 3 vectors in 8-dimensional term vector space 1 Weights: tij = 1 if document i contains term j and zero otherwise Basic Vector Space Methods: Similarity between 2 documents The similarity between two documents is a function of the angle between their vectors in the term vector space. t3 d1 d2 t2 t1 Sec. 1.1 Example Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia? One could SEARCH all of Shakespeare’s plays for Brutus and Caesar, then strip out lines containing Calpurnia? Why is that not the answer? – Slow (for large corpora) – NOT Calpurnia is non-trivial – Ranked retrieval (best documents to return Sec. 1.1 Term-document incidence matrix (vector model) Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Antony 1 1 0 0 0 1 Brutus 1 1 0 1 0 0 Caesar 1 1 0 1 1 1 Calpurnia 0 1 0 0 0 0 Cleopatra 1 0 0 0 0 0 mercy 1 0 1 1 1 1 worser 1 0 1 1 1 0 Brutus AND Caesar BUT NOT Calpurnia 1 if play contains word, 0 otherwise Sec. 1.1 Incidence vectors So we have a 0/1 vector for each term. To answer query: take the vectors for Brutus, Caesar and Calpurnia (complemented) bitwise AND. 110100 AND 110111 AND 101111 = 100100. Query Engine Index Interface Indexer Users Document Collection A Typical Information Retrieval System How good are the retrieved documents Use Relevance as a measure How relevant is the document retrieved – for the user’s information need. Subjective, but assume it’s measurable Measurable to some extent – How often do people agree a document is relevant to a query • More often than expected How well does it answer the question? – Complete answer? Partial? – Background Information? – Hints for further exploration? How do we measure relevance? Measures: – Binary measure • 1 relevant • 0 not relevant – N-ary measure • • • • 3 very relevant 2 relevant 1 barely relevant 0 not relevant – Negative values? N=? consistency vs. expressiveness tradeoff Sec. 1.1 Precision and recall Defined by relevance of the documents Precision: Fraction of retrieved docs that are relevant to user’s information need Recall: Fraction of relevant docs in collection that are retrieved Precision and Recall Relevant Retrieved Not Retrieved Hit Miss Not Relevant False Alarm Correct Rejection Precision and Recall (Cont’d) Relevant Not Relevant Retrieved a b Not Retrieved c d a P = -------------(a+b) Hits P = -------------Hits + False Alarms a R = -------------(a+c) Hits R = -------------Hits + Misses Precision and Recall (Cont’d) Relevant Not Relevant Retrieved 4 6 10 Not Retrieved 1 89 90 95 100 5 4 P = -------------- = 0.4 (4+6) 4 R = -------------- = 0.8 (4+1) Relevance • An information system should give user relevant information • However, determining exactly what document is relevant can get complex – It depends on the content • Is it on topic? – It depends on the task • Is it trustworthy? – It depend on the user • What have they already looked at? • What languages do they understand? • Is relevance all-or-none? Or, can there be partial relevance? • Also good for evaluation of IR system How do we Measure Relevance? • In order to evaluate the performance of information retrieval systems, we need to measure relevance. • Usually the best we can do is to have an expert make ratings of the pertinence. • User as expert or evaluator Precision vs. Recall | RelRetrieved | Precision | Retrieved| | RelRetrieved | Recall | Rel in Collection| All docs Retrieved Relevant Retrieved vs. Relevant Documents Very high precision, very low recall retrieved Relevant Retrieved vs. Relevant Documents High recall, but low precision Relevant retrieved Retrieved vs. Relevant Documents Very low precision, very low recall (0 for both) retrieved Relevant Retrieved vs. Relevant Documents High precision, high recall (at last!) retrieved Relevant Sec. 1.1 How big can be the Term-Document Matrix Consider N = 1 million documents, each with about 1000 words. Avg 6 bytes/word including spaces/punctuation – 6GB of data in the documents. Say there are M = 500K distinct terms among these. The Term-Doc Matrix will be 500K x 1M , i.e. half-a-trillion 0’s and 1’s. But it has no more than one billion 1’s. – matrix is extremely sparse. Query Engine Index Interface Indexer Users Document Collection A Typical Information Retrieval System Sec. 1.2 Inverted index For each term t, we must store a list of all documents that contain t. – Identify each by a docID, a document serial number We need variable-size postings lists In memory, can use linked lists or variable length arrays Brutus 1 Caesar 1 Calpurnia 2 2 2 31 Posting 4 11 31 45 173 174 4 5 6 16 57 132 54 101 Postings Dictionary Sorted by docID. Sec. 1.2 Inverted index construction Documents to be indexed. Friends, Romans, countrymen. Tokenizer Friends Romans Token stream. Countrymen Linguistic modules friend Modified tokens. Indexer Inverted index. roman countryman friend 2 4 roman 1 2 countryman 13 16 Sec. 1.2 Indexer steps: Token sequence Sequence of (Modified token, Document ID) pairs. Doc 1 I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me. Doc 2 So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious Sec. 1.2 Indexer steps: Sort Sort by terms – And then docID Core indexing step Sec. 1.2 Indexer steps: Dictionary & Postings Multiple term entries in a single document are merged. Split into Dictionary and Postings Doc. frequency information is added. Sec. 1.3 Query processing: AND Consider processing the query: Brutus AND Caesar – Locate Brutus in the Dictionary; • Retrieve its postings. – Locate Caesar in the Dictionary; • Retrieve its postings. – “Merge” the two postings: 2 4 8 16 1 2 3 5 32 8 64 13 128 21 Brutus 34 Caesar Sec. 1.3 Boolean queries: Exact match The Boolean retrieval model is being able to ask a query that is a Boolean expression: – Boolean Queries are queries using AND, OR and NOT to join query terms • Views each document as a set of words • Is precise: document matches condition or not. Good for expert users with precise Not good for the majority of users. Most users incapable of writing Boolean queries (or they are, but they think it’s too much work). Most users don’t want to wade through 1000s of results. Problem with Boolean search: feast or famine Ch. 6 Boolean queries often result in either too few (=0) or too many (1000s) results. Query 1: “standard user dlink 650” → 200,000 hits Query 2: “standard user dlink 650 no card found”: 0 hits It takes a lot of skill to come up with a query that produces a manageable number of hits. – AND gives too few; OR gives too many Ranked retrieval models Rather than a set of documents satisfying a query expression, in ranked retrieval models, the system returns an ordering over the (top) documents in the collection with respect to a query Free text queries: Rather than a query language of operators and expressions, the user’s query is just one or more words in a human language Sec. 6.2 Binary term-document incidence matrix Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Antony 1 1 0 0 0 1 Brutus 1 1 0 1 0 0 Caesar 1 1 0 1 1 1 Calpurnia 0 1 0 0 0 0 Cleopatra 1 0 0 0 0 0 mercy 1 0 1 1 1 1 worser 1 0 1 1 1 0 Each document is represented by a binary vector ∈ {0,1}|V| Sec. 6.2 Term-document count matrices Consider the number of occurrences of a term in a document: – Each document is a count vector in ℕv: a column below – This is called the bag of words model. Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Antony 157 73 0 0 0 0 Brutus 4 157 0 1 0 0 Caesar 232 227 0 2 1 1 Calpurnia 0 10 0 0 0 0 Cleopatra 57 0 0 0 0 0 mercy 2 0 3 5 5 1 worser 2 0 1 1 1 0 Term frequency tf The term frequency tft,d of term t in document d is defined as the number of times that t occurs in d. We want to use tf when computing query-document match scores. But how? Raw term frequency is not what we want: – A document with 10 occurrences of the term is more relevant than a document with 1 occurrence of the term. – But not 10 times more relevant. Relevance does not increase proportionally with term frequency. Sec. 6.2 Log-frequency weighting The log frequency weight of term t in d is wt,d 1 log10 t ft,d , 0, if t ft,d 0 ot herwise 0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc. Score for a document-query pair: sum over terms t in both q and d: score (1 log tf ) tqd t ,d The score is 0 if none of the query terms is present in the document. Sec. 6.2.1 Document term frequency Rare terms are more informative than frequent terms A document containing the term arachnocentric is very likely to be relevant to the query arachnocentric → We want a high weight for rare terms like arachnocentric. Sec. 6.2.1 Document frequency, continued Frequent terms are less informative than rare terms – Consider a query term that is frequent in the collection (e.g., high, increase, line) – A document containing such a term is more likely to be relevant than a document that doesn’t – But it’s not a sure indicator of relevance. → For frequent terms, we want high positive weights for words like high, increase, and line – But lower weights than for rare terms. We will use document frequency (df) to capture this. Sec. 6.2.1 idf weight dft is the document frequency of t: the number of documents that contain t – dft is an inverse measure of the informativeness of t – dft N We define the idf (inverse document frequency) of t by idf t log10 ( N/df t ) – We use log (N/dft) instead of N/dft to “dampen” the effect of idf. Sec. 6.2.1 idf example, suppose N = 1 million term dft idft calpurnia 1 animal 100 sunday 1,000 fly 10,000 under the 100,000 1,000,000 idf t log10 ( N/df t ) There is one idf value for each term t in a collection. Sec. 6.2.1 Collection vs. Document frequency The collection frequency of t is the number of occurrences of t in the collection, counting multiple occurrences. Example: Word Collection frequency Document frequency insurance 10440 3997 try 10422 8760 Which word is a better search term (and should get a higher weight)? Sec. 6.2.2 tfidf weighting The tfidf weight of a term is the product of its tf weight and its idf weight. w t ,d (1 logtft ,d ) log10 ( N / df t ) The weight increases with 1. 2. the number of occurrences within a document the rarity of the term in the collection Sec. 6.2.2 Final ranking of documents for a query Score (q,d) tq d tf.idf t,d Sec. 6.3 Binary → count → weight matrix Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Antony 5.25 3.18 0 0 0 0.35 Brutus 1.21 6.1 0 1 0 0 Caesar 8.59 2.54 0 1.51 0.25 0 Calpurnia 0 1.54 0 0 0 0 Cleopatra 2.85 0 0 0 0 0 mercy 1.51 0 1.9 0.12 5.25 0.88 worser 1.37 0 0.11 4.15 0.25 1.95 Each document is now represented by a real-valued vector of tfidf weights ∈ R|V| Sec. 6.3 Documents as vectors So we have a |V|-dimensional vector space Terms are axes of the space Documents are points or vectors in this space Very high-dimensional: tens of millions of dimensions when you apply this to a web search engine These are very sparse vectors - most entries are zero. Query Engine Index Interface Indexer Users Document Collection A Typical Information Retrieval System Sec. 6.3 Queries as vectors Key idea 1: Do the same for queries: represent them as vectors in the space Key idea 2: Rank documents according to their proximity to the query in this space Sec. 6.3 How to measure similarity? [Method 1] The Euclidean distance between q and d2 [Method 2] Rank documents according to angle with query Sec. 6.3 From angles to cosines But how – and why – should we be computing cosines? Sec. 6.3 From angles to cosines The following two notions are equivalent. – Rank documents in decreasing order of the angle between query and document – Rank documents in increasing order of cosine(query,document) Cosine is a monotonically decreasing function for the interval [0o, 180o] Sec. 6.3 cosine(query,document) Dot product Unit vectors qd q d cos(q , d ) q d qd V q di i 1 i V 2 i 1 i q qi is the tf-idf weight of term i in the query di is the tf-idf weight of term i in the document cos(q,d) is the cosine similarity of q and d … or, equivalently, the cosine of the angle between q and d. 2 d i1 i V Summary – vector space ranking Represent the query as a weighted tf-idf vector Represent each document as a weighted tf-idf vector Compute the cosine similarity score for the query vector and each document vector Rank documents with respect to the query by score Return the top K (e.g., K = 10) to the user Methods for Selecting Weights Empirical Test a large number of possible weighting schemes with actual data. (Based on work of Salton, et al.) What is often used today. Model based Develop a mathematical model of word distribution and derive weighting scheme theoretically. (Probabilistic model of information retrieval.) Discussion of Similarity The choice of similarity measure is widely used and works well on a wide range of documents, but has no theoretical basis. 1. There are several possible measures other that angle between vectors 2. There is a choice of possible definitions of tf and idf 3. With fielded searching, there are various ways to adjust the weight given to each field. Importance of automated indexing • Crucial to modern information retrieval and web search • Vector model of documents and queries – Inverted index • Scalability issues crucial for large systems Similarity Ranking Methods Methods that look for matches (e.g., Boolean) assume that a document is either relevant to a query or not relevant. Similarity ranking methods: measure the degree of similarity between a query and a document. Similar Query Documents Similar: How similar is document to a request? Similarity Ranking Methods Query Index database Documents Mechanism for determining the similarity of the query to the document. Set of documents ranked by how similar they are to the query Organization of Files for Full Text Searching Word list Postings Term Pointer to postings ant bee cat dog elk fox gnu hog Inverted lists Documents store Indexing Subsystem Documents text assign document IDs break into tokens tokens *Indicates optional operation. documents stop list* non-stoplist stemming* tokens stemmed terms document numbers and *field numbers term weighting* terms with weights Index database Search Subsystem query parse query ranked document set query tokens stop list* non-stoplist tokens ranking* stemming* *Indicates optional operation. Boolean retrieved operations* document set relevant document set stemmed terms Index database What we covered • • • • • • Documents parsing – text Zipf’s law Documents as bag of words Tokens and terms Inverted index file structure Relevance and evaluation • Precision/recall • Documents and queries as vectors • Similarity measures for ranking Issues What are the theoretical foundations for IR? Where do documents come from? How will this relate to search engines?