Information Retrieval and Vector Space Model Computational Linguiestic Course Instructor: Professor Cercone Presenter: Morteza zihayat Outline Introduction to IR IR System Architecture Vector Space Model (VSM) How to Assign Weights? TF-IDF Weighting Example Advantages and Disadvantages of VS Model Improving the VS Model 2 Information Retrieval and Vector Space Model Outline Introduction to IR IR System Architecture Vector Space Model (VSM) How to Assign Weights? TF-IDF Weighting Example Advantages and Disadvantages of VS Model Improving the VS Model 3 Information Retrieval and Vector Space Model Introduction to IR The world's total yearly production of unique information stored in the form of print, film, optical, and magnetic content would require roughly 1.5 billion gigabytes of storage. This is the equivalent of 250 megabytes per person for each man, woman, and child on earth. (Lyman & Hal 00) 4 Information Retrieval and Vector Space Model Growth of textual information Literature WWW How can we help manage and exploit all the information? News 5 Email Desktop Blog Intranet Retrieval and Vector Space Model Information Information overflow 6 Information Retrieval and Vector Space Model What is Information Retrieval (IR)? Narrow-sense: IR= Search Engine Technologies (IR=Google, library info system) IR= Text matching/classification Broad-sense: IR = Text Information Management: General problem: how to manage text information? How to find useful information? (retrieval) Example: Google How to organize information? (text classification) Example: Automatically assign emails to different folders How to discover knowledge from text? (text mining) Example: Discover correlation of events 7 Information Retrieval and Vector Space Model Outline Introduction to IR IR System Architecture Vector Space Model (VSM) How to Assign Weights? TF-IDF Weighting Example Advantages of VS Model Improving the VSM Model 8 Information Retrieval and Vector Space Model Formalizing IR Tasks Vocabulary: V = {w1,w2, …, wT} of a language Query: q = q1, q2, …, qm where qi ∈V. Document: di= di1, di2, …, dimi where dij∈V. Collection: C = {d1, d2, …, dN} Relevant document set: R(q) ⊆C:Generally unknown and user-dependent Query provides a “hint” on which documents should be in R(q) IR: find the approximate relevant document set R’(q) 9 Source: This slide is borrowed from [1] Information Retrieval and Vector Space Model Evaluation measures The quality of many retrieval systems depends on how well they manage to rank relevant documents. How can we evaluate rankings in IR? IR researchers have developed evaluation measures specifically designed to evaluate rankings. Most of these measures combine precision and recall in a way that takes account of the ranking. 10 Information Retrieval and Vector Space Model Precision & Recall 11 Source: This slide is borrowed from [1] Information Retrieval and Vector Space Model In other words: Precision is the percentage of relevant items in the returned set Recall is the percentage of all relevant documents in the collection that is in the returned set. 12 Information Retrieval and Vector Space Model Evaluating Retrieval Performance 13 Source: This slide is borrowed from [1] Information Retrieval and Vector Space Model IR System Architecture docs INDEXING Query Doc Rep Rep SEARCHING Ranking Feedback query User results INTERFACE judgments QUERY MODIFICATION 14 Information Retrieval and Vector Space Model Indexing Document Break documents into words Stop list Stemming Construct Index Information Retrieval and Vector Space Model 15 Searching Given a query, score documents efficiently The basic question: Given a query, how do we know if document A is more relevant than B? If document A uses more query words than document B Word usage in document A is more similar to that in query …. We should find a way to compute relevance Query and documents 16 Information Retrieval and Vector Space Model The Notion of Relevance Relevance (Rep(q), Rep(d)) Similarity Different rep & similarity Regression Model (Fox 83) … Vector space Prob. distr. model model (Salton et al., 75) (Wong & Yao, 89) Today’s lecture P(d q) or P(q d) Probabilistic inference P(r=1|q,d) r {0,1} Probability of Relevance Generative Model Doc generation Query generation Different inference system Prob. concept space model (Wong & Yao, 95) Classical LM prob. Model approach (Robertson & (Ponte & Croft, 98) Sparck Jones, 76) (Lafferty & Zhai, 01a) Inference network model (Turtle & Croft, 91) Infor 17 Relevance = Similarity Assumptions Query and document are represented similarly A query can be regarded as a “document” Relevance(d,q) similarity(d,q) R(q) = {dC|f(d,q)>}, f(q,d)=(Rep(q), Rep(d)) Key issues How to represent query/document? Vector Space Model (VSM) How to define the similarity measure ? Information Retrieval and Vector Space Model 18 Outline Introduction to IR IR System Architecture Vector Space Model (VSM) How to Assign Weights? TF-IDF Weighting Example Advantages of VS Model Improving the VSM Model 19 Information Retrieval and Vector Space Model Vector Space Model (VSM) The vector space model is one of the most widely used models for ad-hoc retrieval Used in information filtering, information retrieval, indexing and relevancy rankings. 20 Information Retrieval and Vector Space Model VSM Represent a doc/query by a term vector Term: basic concept, e.g., word or phrase Each term defines one dimension N terms define a high-dimensional space E.g., d=(x1,…,xN), xi is “importance” of term I Measure relevance by the distance between the query vector and document vector in the vector space 21 Information Retrieval and Vector Space Model VS Model: illustration Starbucks D2 D9 D11 ?? ?? D5 D3 D10 D4 D6 Query D7 D8 Microsoft D1 Java ?? Infor 22 Some Issues about VS Model There is no consistent definition for basic concept Assigning weights to words has not been determined Weight in query indicates importance of term 24 Information Retrieval and Vector Space Model Outline Introduction to IR IR System Architecture Vector Space Model (VSM) How to Assign Weights? TF-IDF Weighting Example Advantages of VS Model Improving the VSM Model 25 Information Retrieval and Vector Space Model How to Assign Weights? Different terms have different importance in a text A term weighting scheme plays an important role for the similarity measure. Higher weight = greater impact We now turn to the question of how to weight words in the vector space model. 26 Information Retrieval and Vector Space Model There are three components in a weighting scheme: gi: the global weight of the ith term, tij: is the local weight of the ith term in the jth document, dj:the normalization factor for the jth document 27 Information Retrieval and Vector Space Model Outline Introduction to IR IR System Architecture Vector Space Model (VSM) How to Assign Weights? TF-IDF Weighting Example Advantages of VS Model Improving the VSM Model 29 Information Retrieval and Vector Space Model TF Weighting Idea: A term is more important if it occurs more frequently in a document Formulas: Let f(t,d) be the frequency count of term t in doc d Raw TF: TF(t,d) = f(t,d) Log TF: TF(t,d)=log f(t,d) Maximum frequency normalization: TF(t,d) = 0.5 +0.5*f(t,d)/MaxFreq(d) Normalization of TF is very important! Information Retrieval and Vector Space Model 30 TF Methods 31 Information Retrieval and Vector Space Model IDF Weighting Idea: A term is more discriminative if it occurs only in fewer documents Formula: IDF(t) = 1 + log(n/k) n : total number of docs k : # docs with term t (doc freq) Information Retrieval and Vector Space Model 32 IDF weighting Methods 33 Information Retrieval and Vector Space Model TF Normalization Why? Document length variation “Repeated occurrences” are less informative than the “first occurrence” Two views of document length A doc is long because it uses more words A doc is long because it has more contents Generally penalize long doc, but avoid overpenalizing Information Retrieval and Vector Space Model 34 TF-IDF Weighting TF-IDF weighting : weight(t,d)=TF(t,d)*IDF(t) Common in doc high tf high weight Rare in collection high idf high weight Imagine a word count profile, what kind of terms would have high weights? Information Retrieval and Vector Space Model 35 How to Measure Similarity? Di ( wi1 ,...,wiN ) Q ( wq1 ,...,wqN ) Dot product similarity: w 0 if a termis absent N SC(Q, Di ) wqj wij j 1 N Cosine : sim(Q, Di ) w j 1 qj N wij (wqj ) 2 j 1 N 2 ( w ) ij j 1 ( normalizeddot product) Information Retrieval and Vector Space Model 36 Outline Introduction to IR IR System Architecture Vector Space Model (VSM) How to Assign Weights? TF-IDF Weighting Example Advantages of VS Model Improving the VSM Model 37 Information Retrieval and Vector Space Model VS Example: Raw TF & Dot Product doc1 information retrieval search engine information Sim(q,doc1)=4.8*2.4+4.5*4.5 query=“information retrieval” Sim(q,doc2)=2.4*2.4 travel information doc2 map travel doc3 Sim(q,doc3)=0 government IDF president (fake) congress Doc1 Doc2 … Info. Retrieval Travel Map Search Engine Govern. President Congress 2.4 4.5 2.8 3.3 2.1 5.4 2.2 3.2 4.3 2(4.8) 1(4.5) 1(2.1) 1(5.4) 1(2.2) 1(3.2) 1(4.3) 1(2.4) 2(5.6) 1(3.3) Doc3 Query 1(2.4) 1(4.5) Information Retrieval and Vector Space Model 38 Example Q: “gold silver truck” • D1: “Shipment of gold delivered in a fire” • D2: “Delivery of silver arrived in a silver truck” • D3: “Shipment of gold arrived in a truck” • Document Frequency of the jth term (dfj ) • Inverse Document Frequency (idf) = log10(n / dfj) Tf*idf is used as term weight here Information Retrieval and Vector Space Model 39 Example (Cont’d) Id 1 2 3 4 5 6 7 8 9 10 11 Term a arrived damaged delivery fire gold in of silver shipment truck df 3 2 1 1 1 1 3 3 1 2 2 Information Retrieval and Vector Space Model idf 0 0.176 0.477 0.477 0.477 0.176 0 0 0.477 0.176 0.176 40 Example(Cont’d) Tf*idf is used here doc t1 t2 t3 t4 t5 D1 0 0 0.477 0 D2 0 0.176 0 0.477 0 D3 0 0.176 0 0 Q 0 0 0 0 t6 t7 t8 t9 t10 t11 0 0 0 0.176 0 0 0 0 0.954 0 0.176 0 0.176 0 0 0 0 0.176 0 0 0.477 0.477 0.176 0.176 0.176 0 0.176 SC(Q, D1 ) = (0)(0) + (0)(0) + (0)(0.477) + (0)(0) + (0)(0.477)+ (0.176)(0.176) + (0)(0) + (0)(0) = 0.031 SC(Q, D2 ) = 0.486 SC(Q,D3) = 0.062 The ranking would be D2,D3,D1. • This SC uses the dot product. Information Retrieval and Vector Space Model 41 Outline Introduction to IR IR System Architecture Vector Space Model (VSM) How to Assign Weights? TF-IDF Weighting Example Advantages and Disadvantages of VS Model Improving the VSM Model 42 Information Retrieval and Vector Space Model Advantages of VS Model Empirically effective! (Top TREC performance) Intuitive Easy to implement Well-studied/Most evaluated The Smart system Developed at Cornell: 1960-1999 Still widely used Warning: Many variants of TF-IDF! Information Retrieval and Vector Space Model 43 Disadvantages of VS Model Assume term independence Assume query and document to be the same Lots of parameter tuning! Information Retrieval and Vector Space Model 44 Outline Introduction to IR IR System Architecture Vector Space Model (VSM) How to Assign Weights? TF-IDF Weighting Example Advantages and Disadvantages of VS Model Improving the VSM Model 45 Information Retrieval and Vector Space Model Improving the VSM Model We can improve the model by: Reducing the number of dimensions eliminating all stop words and very common terms stemming terms to their roots Latent Semantic Analysis Not retrieving documents below a defined cosine threshold Normalized frequency of a term i in document j is given by : Normalized Document Frequencies Normalized Query Frequencies [1] 46 Information Retrieval and Vector Space Model Stop List Function words do not bear useful information for IR of, not, to, or, in, about, with, I, be, … Stop list: contain stop words, not to be used as index Prepositions Articles Pronouns Some adverbs and adjectives Some frequent words (e.g. document) The removal of stop words usually improves IR effectiveness A few “standard” stop lists are commonly used. Information Retrieval and Vector Space Model 47 Stemming Reason: ◦ Different word forms may bear similar meaning (e.g. search, searching): create a “standard” representation for them Stemming: ◦ Removing some endings of word dancer dancers dance danced dancing dance 48 Information Retrieval and Vector Space Model Stemming(Cont’d) Two main methods : Linguistic/dictionary-based stemming high stemming accuracy high implementation and processing costs and higher coverage Porter-style stemming lower stemming accuracy lower implementation and processing costs and lower coverage Usually sufficient for IR Information Retrieval and Vector Space Model 49 Latent Semantic Indexing (LSI) [3] Reduces the dimensions of the term-document space Attempts to solve the synonomy and polysemy Uses Singular Value Decomposition (SVD) identifies patterns in the relationships between the terms and concepts contained in an unstructured collection of text Based on the principle that words that are used in the same contexts tend to have similar meanings. 50 Information Retrieval and Vector Space Model LSI Process In general, the process involves: constructing a weighted term-document matrix performing a Singular Value Decomposition on the matrix using the matrix to identify the concepts contained in the text LSI statistically analyses the patterns of word usage across the entire document collection 51 Information Retrieval and Vector Space Model References Introduction to Information Retrieval, by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schuetze https://wiki.cse.yorku.ca/course_archive/2010-11/F/6390/_media/2.pdf https://wiki.cse.yorku.ca/course_archive/2010-11/F/6390/_media/ir4up.pdf https://wiki.cse.yorku.ca/course_archive/2010-11/F/6390/_media/e09-3009.pdf https://wiki.cse.yorku.ca/course_archive/2010-11/F/6390/_media/07models-vsm.pdf https://wiki.cse.yorku.ca/course_archive/2010-11/F/6390/_media/03vectorspaceimplementation-6per.pdf https://wiki.cse.yorku.ca/course_archive/2011-12/F/6339/_media/lecture02.ppt https://wiki.cse.yorku.ca/course_archive/2011-12/F/6339/_media/vector_space_modelupdated.ppt https://wiki.cse.yorku.ca/course_archive/201112/F/6339/_media/lecture_13_ir_and_vsm_.ppt Document Classification based on Wikipedia Content, http://www.iicm.tugraz.at/cguetl/courses/isr/opt/classification/Vector_Space_Model.html?tim 54 estamp=1318275702299 Information Retrieval and Vector Space Model Thanks For Your Attention …. 55 Infor