CS 430 Information Discovery Midterm Examination Wednesday, October 31, 2001 7:30 to 9:00 p.m. Instructions 1) Answer all questions. 2) Write your answers in an examination book. WRITE YOUR NETID ON THE FRONT OF EACH BOOK. 3) This is an open book examination. Question 1 The course looked at the following options to use in calculating the similarity between two documents: (i) (ii) Inner (dot) product with no weighting. Cosine (dividing by product of lengths) to normalize for vectors of different lengths. (iii) Term weighting using frequency of terms within the document. (iv) Term weighting using an inverse function of terms in the entire collection (e.g., Inverted Document Frequency). (a) The aggregate term weighting is sometimes written: w = tf * idf, which combines options (iii) and (iv). Explain the purpose of each of option (iii) and option (iv) in calculating term weights. (b) How would you modify the term weighting to account for documents that differ greatly in length? (c) Consider the query: Q: bee cat cat elk and the two following documents: D1: ant bee cat dog ant dog ant fox gnu D2: ant cat cat elk elk dog dog ant What is the similarity between this query and each of the documents, without term weighting? How would you rank the documents? Page 1 of 3 Question 2 (a) Why are precision and recall difficult measures to use for retrieval effectiveness when there is a user in loop? (b) A common complaint about web search systems is that a simple search returns thousands of hits. (i) (ii) Why does this happen? Is it necessarily bad? (c) The Google ranking algorithm can be written: n PTi P A 1 d d C T i i 1 Where page A has pages Ti pointing to it, C(A) is the number of links out of A and d is a damping factor. (i) (ii) Explain the concept behind this algorithm. How is this algorithm used to address the problems discussed in part (b)? Page 2 of 3 Question 3 (a) We have examined two general strategies for information retrieval. The first is automatic indexing of full text. The second is to create a catalog record and to carry out fielded searching of the catalog record. (i) Give three advantages of the first strategy. (ii) Give three advantages of the second strategy. (b) Here is a Dublin Core metadata record for the web site http://www.georgewbush.com/. This record was generated automatically by a computer program, with no manual input. Subject: George W. Bush; Bush; George Bush; President; republican; 2000 election; election; presidential election; George; B2K; Bush for President; Junior; Texas; Governor; taxes; technology; education; agriculture; health care; environment; society; social security; medicare; income tax; foreign policy; defense; government Description: George W. Bush is running for President of the United States to keep the country prosperous. Publisher: Concentric Network Corporation Date: 2001-01-12 Type: Text Format: text/html Format: 12223 bytes Identifier: http://www.georgewbush.com/ (i) For each of these eight fields, is the metadata that was automatically generated consistent with the Dublin Core definitions? (ii) The Publisher field has the unexpected value, "Concentric Network Corporation", which appears nowhere on the web site. Where does this value come from? (iii) How could a program that generates metadata automatically construct a Title field for a web site? Why does this record not have a title field? Page 3 of 3 Page 4 of 3