CS 430 Information Discovery Midterm Examination

advertisement
CS 430 Information Discovery
Midterm Examination
Wednesday, October 31, 2001
7:30 to 9:00 p.m.
Instructions
1) Answer all questions.
2) Write your answers in an examination book. WRITE YOUR NETID ON THE
FRONT OF EACH BOOK.
3) This is an open book examination.
Question 1
The course looked at the following options to use in calculating the similarity between
two documents:
(i)
(ii)
Inner (dot) product with no weighting.
Cosine (dividing by product of lengths) to normalize for vectors of different
lengths.
(iii) Term weighting using frequency of terms within the document.
(iv) Term weighting using an inverse function of terms in the entire collection
(e.g., Inverted Document Frequency).
(a) The aggregate term weighting is sometimes written: w = tf * idf, which combines
options (iii) and (iv).
Explain the purpose of each of option (iii) and option (iv) in calculating term weights.
(b) How would you modify the term weighting to account for documents that differ
greatly in length?
(c) Consider the query:
Q: bee cat cat elk
and the two following documents:
D1: ant bee cat dog ant dog ant fox gnu
D2: ant cat cat elk elk dog dog ant
What is the similarity between this query and each of the documents, without term
weighting? How would you rank the documents?
Page 1 of 3
Question 2
(a) Why are precision and recall difficult measures to use for retrieval effectiveness
when there is a user in loop?
(b) A common complaint about web search systems is that a simple search returns
thousands of hits.
(i)
(ii)
Why does this happen?
Is it necessarily bad?
(c) The Google ranking algorithm can be written:
 n PTi  

P A  1  d   d  


C
T
i 
 i 1
Where page A has pages Ti pointing to it, C(A) is the number of links out of A and d is
a damping factor.
(i)
(ii)
Explain the concept behind this algorithm.
How is this algorithm used to address the problems discussed in part (b)?
Page 2 of 3
Question 3
(a) We have examined two general strategies for information retrieval. The first is
automatic indexing of full text. The second is to create a catalog record and to carry
out fielded searching of the catalog record.
(i)
Give three advantages of the first strategy.
(ii)
Give three advantages of the second strategy.
(b) Here is a Dublin Core metadata record for the web site
http://www.georgewbush.com/. This record was generated automatically by a
computer program, with no manual input.
Subject:
George W. Bush; Bush; George Bush; President; republican;
2000 election; election; presidential election; George; B2K;
Bush for President; Junior; Texas; Governor; taxes; technology;
education; agriculture; health care; environment; society; social
security; medicare; income tax; foreign policy; defense;
government
Description:
George W. Bush is running for President of the United States to
keep the country prosperous.
Publisher:
Concentric Network Corporation
Date:
2001-01-12
Type:
Text
Format:
text/html
Format:
12223 bytes
Identifier:
http://www.georgewbush.com/
(i)
For each of these eight fields, is the metadata that was automatically
generated consistent with the Dublin Core definitions?
(ii)
The Publisher field has the unexpected value, "Concentric Network
Corporation", which appears nowhere on the web site. Where does this value
come from?
(iii) How could a program that generates metadata automatically construct a Title
field for a web site? Why does this record not have a title field?
Page 3 of 3
Page 4 of 3
Download