CS4485: Information Retrieval Who I am: 2016/5/29 Dr. Lusheng WANG Dept. of Computer Science office: Y6429 phone: 2788 9820 e-mail: lwang@cs.cityu.edu.hk web site: http://www.cs.cityu.edu.hk/~lwang/ CS4485 Information Retrieval /WANG Lusheng Page 1 Text Book: • B-Y Ricardo and R-N Berthier, Modern Information Retrieval, Addison Wesley, 1999. • We will add more material in the handout. References: • W.B. Frakes and R. Baeza-Yates. Information Retrieval:Data Structures & Algorithms. Prentice Hall,Englewood Cliffs,NJ,USA,1992 • I.H. Witten, A. Moffat, and T.C.Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Van Nostrand Reinhold, NewYork, 1994. • Michael Lesk. Practical Digital Libraries; Books,Bytes, and Bucks. Morgan Kaufmann, 1997. 2016/5/29 CS4485 Information Retrieval /WANG Lusheng Page 2 Information Retrieval User task: Translate the information needed into query in some language Provide some words Information Retrieval v.s. Browsing Information retrieval: finding useful information. Browsing: The objectives are not clearly defined and may change during the browsing process. Most system combines the two types. 2016/5/29 CS4485 Information Retrieval /WANG Lusheng Page 3 Logic View of the documents Classic view– a set of index terms or keywords Full text logic view: keep the full text (with computers) Still need some special treatment (chapter 7) • Elimination of stopwords (useless words appear in all documents) • Use of stemming (reduces distinct words to their common grammatical root) • Identification of noun groups (eliminates adjectives, adverbs, and verbs) • Compression techniques Structures are used—structured text retrieval models (chapters, section, subsections) 2016/5/29 CS4485 Information Retrieval /WANG Lusheng Page 4 What we will cover (Syllabus:http://www.cs.cityu.edu.hk//content/courses/index.html) Retrieval models for text (documents) Retrieval models for hypertext (searching the web) Retrieval Evaluation Query Languages Query operations Text operations Chinese language text operations Indexing and searching (algorithmic issues) Brief introduction to multimedia IR. 2016/5/29 CS4485 Information Retrieval /WANG Lusheng Page 5 Evaluation 50% coursework 50% examination Coursework: 1 assignment 20% A midterm examination 20% A project (do it in pairs) 60% 2016/5/29 CS4485 Information Retrieval /WANG Lusheng Page 6 Definitions A database is a collection of documents. A document is a sequence of terms, expressing ideas about some topic in a natural language. A term is a semantic unit, a word, phrase, or potentially root of a word. A query is a request for documents pertaining to some topic. 2016/5/29 CS4485 Information Retrieval /WANG Page 7 Lusheng Definitions (Cont.) An Information Retrieval (IR) System attempts to find relevant documents to respond to a user’s request. The real problem boils down to matching the language of the query to the language of the document. 2016/5/29 CS4485 Information Retrieval /WANG Lusheng Page 8 Hard Parts of IR Simply matching on words is a very brittle approach. One word can have a zillion different semantic meanings 2016/5/29 Consider: Take “take a place at the table” “take money to the bank” “take a picture” “take a lot of time” “take drugs” CS4485 Information Retrieval /WANG Lusheng Page 9 More problems with IR You can’t even tell what part of speech a word has: 2016/5/29 “I saw her duck.” A query that searches for “pictures of a duck” will find documents that contain “I saw her duck away from the ball galling from the sky” CS4485 Information Retrieval /WANG Lusheng Page 10 More Problems with IR Proper Nouns often use regular old nouns Consider a document with “a man named Abraham owned a Lincoln” A word matching query for “Abraham Lincoln” may well find the above document. 2016/5/29 CS4485 Information Retrieval /WANG Lusheng Page 11 What is Different about IR from the rest of Computer Science Most algorithms in computer science have a “right” answer: Consider the two problems: Sort the following ten integers Find the highest integer Now consider: 2016/5/29 Find the document most relevant to “hippos in the zoo” CS4485 Information Retrieval /WANG Lusheng Page 12 Measuring Effectiveness An algorithm is deemed incorrect if it does not have a “right” answer. A heuristic tries to guess something close to the right answer. Heuristics are measured on “how close” they come to a right answer. IR techniques are essentially heuristics because we do not know the right answer. So we have to measure how close to the right answer we can come. 2016/5/29 CS4485 Information Retrieval /WANG Page 13 Lusheng Precision / Recall Example Consider a query that retrieves 10 documents. Lets say the result set is. D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 If all ten were relevant, we would have 100 percent precision. If there were only ten relevant in the whole collection, we would have 100 percent recall. 2016/5/29 CS4485 Information Retrieval /WANG Lusheng Page 14 Example (continued) Now lets say that only documents two and five are relevant. Consider these results: D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 Since we have retrieved ten documents and gotten two of them right, precision is 20 percent. Recall is 2/totall relevant in entire collection. 2016/5/29 CS4485 Information Retrieval /WANG Lusheng Page 15 Levels of Recall If we keep retrieving documents, we will ultimately retrieve all documents and achieve 100 percent recall. That means that we can keep retrieving documents until we reach x% of recall. 2016/5/29 CS4485 Information Retrieval /WANG Lusheng Page 16 Levels of Recall (example) Retrieve top 2000 documents. Lets say there are five total documents relevant. Document DocID Recall Precision -100 A .20 .01 -200 B .40 .01 -500 C .60 .006 -1000 D .80 .004 -1500 E 1.0 .003 2016/5/29 CS4485 Information Retrieval /WANG Lusheng Page 17 How to evaluation the quality of the retrieval system Let R be the set of all relevant documents A: set of all documents reported as relevant by the system Ra: AR, the set of relevant documents reported. Recall = |Ra|/|R|. Recall = 10%: 10% of the relevant documents in R are found. Precision = |Ra|/|A|. Precision = 90%: 90% of the reported documents are relevant (suppose 100% are relevant). 1. Recall=100% does not mean the system finds ALL relevant documents 2. Precision=100% does not mean all reported documents are relevant. 2016/5/29 CS4485 Information Retrieval /WANG Lusheng Page 18 Evaluating IR Recall is the fraction of relevant documents retrieved from the set of total relevant documents collection-wide. Precision is the fraction of relevant documents retrieved from the total number retrieved. An IR system ranks documents by SC, allowing the user to trade off between precision and recall. 2016/5/29 CS4485 Information Retrieval /WANG Lusheng Page 19 Precision/Recall Tradeoff 100% Top 10 Top 100 Top 1000 Precision Recall 2016/5/29 100% CS4485 Information Retrieval /WANG Lusheng Page 20 Strategy vs Utility An IR strategy is a technique by which a relevance assessment is obtained between a query and a document. An IR utility is a technique that may be used to improve the assessment given by a strategy. A utility may plug into any strategy. 2016/5/29 CS4485 Information Retrieval /WANG Lusheng Page 21 Strategies Manual Boolean Automatic Probabilistic Inference Networks Vector Space Model Latent Semantic Indexing (LSI) Adaptive Models 2016/5/29 Genetic Algorithms Neural Networks CS4485 Information Retrieval /WANG Lusheng Page 22 Retrieval: Ad hoc and Filtering Ad hoc retrieval: the documents in the collection remain relatively static while new queries are submitted to the system. (library) Filtering: queries remain relatively the same while new documents come and leave the system. (stock market) 2016/5/29 CS4485 Information Retrieval /WANG Lusheng Page 23 A formal Characterization of IR models Definition An information retrieval model is a quadruple [D,Q,F,R(qi,dj)] where 2016/5/29 CS4485 Information Retrieval /WANG Lusheng Page 24 Continue: (1) D is a set composed of logical views (or representations) for the documents in the collection. (2) Q is a set composed of logical views (or representations) for the user information needs. Such representations are called queries. (3) F is a framework for modeling document representations, queries, and their relationships. (4) R(qi,dj) is a ranking function which associates a real number with a query qiQ and a document representation djD. Such ranking defines an ordering among the documents with regard to the query qi. 2016/5/29 CS4485 Information Retrieval /WANG Lusheng Page 25 Index terms A document is represented by a set of keywords, called index terms. How to select keywords is an important issue and will be discussed in Chapter 7. Some terms are more important than other terms, e.g., a terms appears in five documents is more important than a term appears in most of the document. The word “The” is not useful while the word “cityU” is important for retrieval information related to our university. 2016/5/29 CS4485 Information Retrieval /WANG Lusheng Page 26 Boolean Model: 1. Each document dj is represented by a vector dj=(w1,j,w2,j, …, wn,j), where wi,j =0 if term ki does not appear in dj and wi,j=1 if term ki is in dj. 2. A query is a Boolean function that is represented as a disjunctive normal form. 2016/5/29 (1,1,1)(1,1,0)(1,0,0) CS4485 Information Retrieval /WANG Lusheng Page 27 An example of Boolean retrieval model 2016/5/29 Documents: • d1=(1, 0, 1, 1, 1, 1, 1, 1), d2=(0, 1, 0, 0, 1, 1, 1, 1) • d3=(0, 0, 0, 1, 1, 1, 1,1), d4=(1, 1, 0, 0, 1, 1, 0, 0 ) Query: (1, 1, 1, 1, 1, 1,1, 1) (1,1, 0,0,1,1,0,0) Result: Only d4 is selected.. CS4485 Information Retrieval /WANG Lusheng Page 28 Representation of documents: Boolean model: d1: Computer science department, computer study, computer algorithms d2:computer study, programming skills, d3: department stores, notebook, Keywords: 1. computer, 2. science, 3. study, 4. store, 5. dept. 6. algorithms, 7. programming, 8. skills, 9. notebook, d1=(1,1,1,0,1,1,0,0,0); d2=(1,0,1,0,0,0,1,1,0); d3=(0,0,0,1,1,0,0,0,1). 2016/5/29 CS4485 Information Retrieval /WANG Lusheng Page 29 Advantages simple, easy to understand by users precise semantics Neat formulation Get great attention in the past Disadvantages 2016/5/29 Binary decision criterion (relevant or non-relevant) Hard to get the Boolean formula for required information. CS4485 Information Retrieval /WANG Lusheng Page 30 Vector Space Model 1. Each document dj is represented by a vector dj=(w1,j,w2,j, …, wn,j), where wi,j ≥0 2. Each query q is also represented by a vector q=(w1,q, w2,q, …, wn,q). 3. The similarity between the document and the query is defined as sim(dj, q) = i=1, , …, n (wi,j wi,q )/ (i=1…n wi,j 2)0.5 (i=1… n wi,q 2)0.5 2016/5/29 CS4485 Information Retrieval /WANG Lusheng Page 31 Example 1: dj=(2, 3, 1, 0) and q=(2, 3, 1, 0). sim(dj,q)=(4+9+1+0)/(4+9+1+0)0.5(4+9+1+0)0.5 =1. Example 2: dj=(0, 0,0,5) and q=(2, 3, 1, 0). sim(dj,q)=0/(25)0.5(4+9+1+0)0.5=0. Example 3: dj=(1, 3, 1,1) and q=(2, 3, 1, 0). sim(dj,q)=(2+9+1+0)/(12)0.5(14)0.5=0.8570.5 =0.925. Example 3: dj=(1, 3, 1,0) and q=(2, 3, 1, 0). sim(dj,q)=(2+9+1+0)/(11)0.5(14)0.5>0.925. 2016/5/29 CS4485 Information Retrieval /WANG Lusheng Page 32 Note that wi,j≥0 and wi,q ≥0, since sim(q, dj) is in [0,1]. 2016/5/29 The documents are ranked according to the similarity. Even if the match is only partial, the document might be retrieved CS4485 Information Retrieval /WANG Lusheng Page 33 How to determine the weights wi,j on terms? Definition : Let N be the total number of documents in the system and ni be the number of documents in which the index ki appears.Let freqi,j be the raw frequency of term ki in the document dj (i.e. the number of times the term ki is mentioned in the text of the document dj). Then, the normalized frequency fi,j of term ki in the document dj is given by freqi , j fi , j max lfreql , j where the maximum is computed over all terms which are mentioned in the text of the document dj. If the term ki does not appear in the document dj then fi,j=0. 2016/5/29 CS4485 Information Retrieval /WANG Lusheng Page 34 Continue: Further, let idfi, inverse document frequency for ki, be given by N idf i log ni The best known term-weighting schemes use weights which are given by N wi , j fi , j * log ni Such term-weighting strategies are called tf-idf schemes. 2016/5/29 CS4485 Information Retrieval /WANG Lusheng Page 35 idfi=ln(1000/ni) ni 2016/5/29 CS4485 Information Retrieval /WANG Lusheng Page 36 idfi=log(1000/ni) ni 2016/5/29 CS4485 Information Retrieval /WANG Lusheng Page 37 Continue: Several variations of the above expression for the weight wi,j are described in an interesting paper by Salton and Buckley which appeared in 1988. For the query term weights, Salton and Buchley suggest 0.5 freqi , q N wi , q (0.5 ) * log max lfreql , q ni 2016/5/29 CS4485 Information Retrieval /WANG Lusheng Page 38 Example 1: d1:Its term-weighting scheme improves retrieval performance; d2:Its partial matching strategy allows retrieval of documents that approximate the query conditions; d3:Its cosine ranking formula sorts the documents according to their degree of similarity to the query. In this example, N=3, for the term ki=“retrieval” , ni=2, idfi=log(3/2)=0.176, freqi,1=1,fi,1=1,wi,1=0.176. 2016/5/29 CS4485 Information Retrieval /WANG Lusheng Page 39 Example 2: d1: Computer science department, computer study, computer algorithms d2:computer study, programming skills, d3: department stores, notebook, Keywords: 1. computer, 2. science, 3. study, 4. store, 5. dept. 6. algorithms, 7. programming, 8. skills, 9. notebook, d1=(2,1,1,0,1,1,0,0,0); d2=(1,0,1,0,0,0,1,1,0); d3=(0,0,0,1,1,0,0,0,1). 2016/5/29 CS4485 Information Retrieval /WANG Lusheng Page 40 freq k i k1 k2 k3 k4 k5 k6 k7 k8 k9 d1 3 1 1 0 1 1 0 0 0 d2 1 0 1 0 0 0 1 1 0 d3 0 0 0 1 1 0 0 0 1 i,j dj 2016/5/29 CS4485 Information Retrieval /WANG Lusheng Page 41 fi,j k i k1 k2 k3 k4 k5 k6 k7 k8 k9 0 0 0 dj d1 1 d2 1 0 1 0 0 0 1 1 0 d3 0 0 0 1 1 0 0 0 1 2016/5/29 0.33 0.33 0 0.33 0.33 CS4485 Information Retrieval /WANG Lusheng Page 42 i 1 2 3 4 5 6 7 8 9 ni 2 1 2 1 2 1 1 1 1 idfi 0.18 0.48 0.18 0.48 0.18 0.48 0.48 0.48 0.48 2016/5/29 CS4485 Information Retrieval /WANG Lusheng Page 43 wi,j k i k1 k2 k3 k4 d1 0.18 0.16 0.06 0 d2 0.18 0 0.18 0 d3 0 0 k5 k6 k7 k8 k9 0.06 0.16 0 0 0 dj 2016/5/29 0 0 0.48 0.18 0 0 CS4485 Information Retrieval /WANG Lusheng 0.48 0.48 0 0 0 0.48 Page 44 Example 2: d1: Computer science dept. Algorithms improve retrieval performance; d2:computer study, algorithm, programming skills, query conditions; d3: computer stores, notebook, printers d4: computer store sales CD’s and software Keywords: 1. computer, 2. science, 3. study, 4. store, 5. dept. 6. algorithms, 7. improve, 8. retrieval, 9. performance, 10. programming, 11. skills, 12. query, 13. conditions, 14. notebook, 15. printers, 16. sales, 17. CD’s, 18. software, 19. algorithm Question: 19 keywords or 18 keywords? – language process Every document contains may “the” do we need it? Table, desk, are they the same? Related? 2016/5/29 CS4485 Information Retrieval /WANG Lusheng Page 45 Summary Information Retrieval models Boolean model Vector space model 2016/5/29 CS4485 Information Retrieval /WANG Lusheng Page 46 Course Arrangement: 2016/5/29 No lecture and tutorial in week 2. I make up class will be scheduled in week 3. CS4485 Information Retrieval /WANG Lusheng Page 47