Question 1 (20 marks) 1. Explain the effect of skip pointers. What are the implications of short skip spans and long skip spans? (5) 2. In a Boolean retrieval system, how does stemming affect the precision and recall? (5) 3. What is relevance feedback? (5) 4. When coding an audio signal, what is the advantage of forming differences rather than using the original signal itself? (5) INT309 Multimedia Information Retrieval and Technology Question 2 (25 marks) Explore the vector space model on information retrieval tasks. Consider a simple collection with the following two documents only: Document 1: the way to the school, walking in the rain Document 2: the rain school closed Also, consider the following stop word list: the, to, in 1. Draw the inverted index representation for this collection. (4) 2. Consider Boolean retrieval, an intersection algorithm INTERSECT(p1 , p2 ) is listed in Fig. Q2.2 for two postings lists p1 and p2 when the query is “walk AND rain”. Design an algorithm for the query “walk OR rain”. (5) Fig. Q2.2 3. Consider the query: (6) Query: school closed rain Compute and report the cosine similarity of the query with each document when we use the stop words shown above. The query is converted to a unit vector using tf-idf weighting wt,d = tft,d × log10 (N/dft ) and Euclidean normalization, tft,d is raw term frequency. Each document is simply represented as a term frequency (tf) vector which is normalized using the maximum tf formula: Page 2 INT309 Multimedia Information Retrieval and Technology 0.25 + [0.75 × tft,d /max(tft,d )] 4. How does the base of the logarithm in idft = log10 (N/dft ) affect the relative scores (rankings) of two documents on a given query? (4) 5. How could we answer phrase queries like “school closed”? (4) 6. Using Rocchio relevance feedback for query optimization, for the modified query vector, why do we need to set negative term weights back to 0? (2) Page 3 INT309 Multimedia Information Retrieval and Technology Question 3 (30 marks) Suppose we have a collection of 20 documents, d1 , d2 , ..., d20 , which have been judged for relevance to a query. A 3-point relevance scale was used, so relevant documents have been divided into Perfect,Good and just Relevant results. Weights for these levels are shown below for NDCG(Normalized Discounted Cumulative Gain): Perfect Good Relevant Non-relevant 3 2 1 0 Consider the result lists retrieved for the three different information needs shown as below respectively: Result Q1 Result Q2 Result Q3 = < 3, 0, 2, 2, 0 > = < 3, 2, 2, 2, 0, 2, 0, 1 > = < 0, 2, 0, 3 > 1. Assume there are totally 10 relevant documents in the collection. What are the precision and recall for result list Result Q2 ? Draw the interpolated precision-recall curve. (4) 2. What is the precision @4 of each result list? (4) 3. What is the average precision of each result list? and what is MAP for this IR system if there are only these three information needs in the test collection? (4) 4. What is the perfect ranking for Result Q2 =< 3, 2, 2, 2, 0, 2, 0, 1 >? And calculate the Ideal Discounted Cumulative Gain (DCG) for this set of documents. (7) 5. To measure/evaluate information retrieval (IR) effectiveness, what are the three elements required for a test collection, so the performance of the IR system could be compared? (3) 6. What is Kappa Measure ? (3) 7. For a particular information need if Judge 1 rated the relevance of a set of 5 documents as Result1 =< R, N, R, R, N > and Judge 2 rated as Result2 =< R, R, R, N, N >. Calculate the Kappa measure if the expected chance agreement ratio P (E) is 0.5. (5) Page 4 INT309 Multimedia Information Retrieval and Technology Question 4 (25 marks) Consider the following supervised corpus of news headlines, where the document class is in bold (not considered as a part of the document): [World News] “Iraq election”, “executive injured” [Business] “executive smiles”, “executive suite” Using this corpus, we will try to predict the class of the document “executive suite”. 1. By using Rocchio Classication Algorithm, compute the centroid of each class. Express the centroids of each class and the query as raw term frequency vectors(normalized). Determine the class of the document. (10) 2. Using the method of maximum likelihood estimation, evaluate: (15) •P̂M LE (World News) •P̂M LE (Business) •P̂M LE (executive|World News) •P̂M LE (executive|Business) •P̂M LE (suite|World News) •P̂M LE (suite|Business) Then determine which class is this document assigned to by a multinomial Naive Bayes Classifier. ——— End of paper ——— Page 5 INT309 Multimedia Information Retrieval and Technology Appendix A: Equation List The entropy: η= X pi log2 1 pi (1) Multinomial: P (c|d) ∝ P (c)Π16k6Nd P (tk |c) Tct + 1 P (tk |c) = P (2) (3) t0 ∈V (Tct0 +1) Rocchio relevance feedback: ~qm = α~q0 + β X 1 X ~ 1 dj + γ d~j |Dr | |Dn r| d~j ∈Dr (4) d~j ∈Dnr Bernoulli: P (c|d) ∝ P (c)Πtk ∈Q P (tk |c)Πtk ∈Q / [1 − P (tk |c)] dfct + 1 P (tk |c) = Nc + N umberclasses Arithmetic Coding Encoder: Page 6 (5) (6) INT309 Multimedia Information Retrieval and Technology Page 7