Uploaded by 1662573746

INT309 Practice II-2021

advertisement
Question 1 (20 marks)
1. Explain the effect of skip pointers. What are the implications of short skip spans and
long skip spans?
(5)
2. In a Boolean retrieval system, how does stemming affect the precision and recall?
(5)
3. What is relevance feedback?
(5)
4. When coding an audio signal, what is the advantage of forming differences rather than
using the original signal itself?
(5)
INT309
Multimedia Information Retrieval and Technology
Question 2 (25 marks)
Explore the vector space model on information retrieval tasks. Consider a simple collection with the following two documents only:
Document 1: the way to the school, walking in the rain
Document 2: the rain school closed
Also, consider the following stop word list:
the, to, in
1. Draw the inverted index representation for this collection.
(4)
2. Consider Boolean retrieval, an intersection algorithm INTERSECT(p1 , p2 ) is listed in
Fig. Q2.2 for two postings lists p1 and p2 when the query is “walk AND rain”. Design
an algorithm for the query “walk OR rain”.
(5)
Fig. Q2.2
3. Consider the query:
(6)
Query: school closed rain
Compute and report the cosine similarity of the query with each document when we
use the stop words shown above. The query is converted to a unit vector using tf-idf
weighting wt,d = tft,d × log10 (N/dft ) and Euclidean normalization, tft,d is raw term frequency. Each document is simply represented as a term frequency (tf) vector which is
normalized using the maximum tf formula:
Page 2
INT309
Multimedia Information Retrieval and Technology
0.25 + [0.75 × tft,d /max(tft,d )]
4. How does the base of the logarithm in idft = log10 (N/dft ) affect the relative scores
(rankings) of two documents on a given query?
(4)
5. How could we answer phrase queries like “school closed”?
(4)
6. Using Rocchio relevance feedback for query optimization, for the modified query vector,
why do we need to set negative term weights back to 0?
(2)
Page 3
INT309
Multimedia Information Retrieval and Technology
Question 3 (30 marks)
Suppose we have a collection of 20 documents, d1 , d2 , ..., d20 , which have been judged for
relevance to a query. A 3-point relevance scale was used, so relevant documents have been
divided into Perfect,Good and just Relevant results. Weights for these levels are shown below
for NDCG(Normalized Discounted Cumulative Gain):
Perfect
Good
Relevant
Non-relevant
3
2
1
0
Consider the result lists retrieved for the three different information needs shown as below
respectively:
Result Q1
Result Q2
Result Q3
= < 3, 0, 2, 2, 0 >
= < 3, 2, 2, 2, 0, 2, 0, 1 >
= < 0, 2, 0, 3 >
1. Assume there are totally 10 relevant documents in the collection. What are the precision
and recall for result list Result Q2 ? Draw the interpolated precision-recall curve.
(4)
2. What is the precision @4 of each result list?
(4)
3. What is the average precision of each result list? and what is MAP for this IR system
if there are only these three information needs in the test collection?
(4)
4. What is the perfect ranking for Result Q2 =< 3, 2, 2, 2, 0, 2, 0, 1 >? And calculate the
Ideal Discounted Cumulative Gain (DCG) for this set of documents.
(7)
5. To measure/evaluate information retrieval (IR) effectiveness, what are the three elements
required for a test collection, so the performance of the IR system could be compared?
(3)
6. What is Kappa Measure ?
(3)
7. For a particular information need if Judge 1 rated the relevance of a set of 5 documents
as Result1 =< R, N, R, R, N > and Judge 2 rated as Result2 =< R, R, R, N, N >.
Calculate the Kappa measure if the expected chance agreement ratio P (E) is 0.5.
(5)
Page 4
INT309
Multimedia Information Retrieval and Technology
Question 4 (25 marks)
Consider the following supervised corpus of news headlines, where the document class is in
bold (not considered as a part of the document):
[World News] “Iraq election”, “executive injured”
[Business] “executive smiles”, “executive suite”
Using this corpus, we will try to predict the class of the document “executive suite”.
1. By using Rocchio Classication Algorithm, compute the centroid of each class. Express
the centroids of each class and the query as raw term frequency vectors(normalized).
Determine the class of the document.
(10)
2. Using the method of maximum likelihood estimation, evaluate:
(15)
•P̂M LE (World News)
•P̂M LE (Business)
•P̂M LE (executive|World News) •P̂M LE (executive|Business)
•P̂M LE (suite|World News)
•P̂M LE (suite|Business)
Then determine which class is this document assigned to by a multinomial Naive Bayes
Classifier.
——— End of paper ———
Page 5
INT309
Multimedia Information Retrieval and Technology
Appendix A: Equation List
The entropy:
η=
X
pi log2
1
pi
(1)
Multinomial:
P (c|d) ∝ P (c)Π16k6Nd P (tk |c)
Tct + 1
P (tk |c) = P
(2)
(3)
t0 ∈V (Tct0 +1)
Rocchio relevance feedback:
~qm = α~q0 + β
X
1 X ~
1
dj + γ
d~j
|Dr |
|Dn r|
d~j ∈Dr
(4)
d~j ∈Dnr
Bernoulli:
P (c|d) ∝ P (c)Πtk ∈Q P (tk |c)Πtk ∈Q
/ [1 − P (tk |c)]
dfct + 1
P (tk |c) =
Nc + N umberclasses
Arithmetic Coding Encoder:
Page 6
(5)
(6)
INT309
Multimedia Information Retrieval and Technology
Page 7
Download