IR Homework #2 By J. H. Wang Mar. 31, 2015 Programming Exercise #2: Query Processing and Searching • Goal: to search relevant documents for a given query • Input: a query (and the inverted index) – (simple search: keyword, Boolean) • Output: a ranked list of search results from ClueWeb09 collection – (details to be described later) Input: User Query and Inverted Index • Simple queries – Single keywords • Ex: Microsoft, airplanes, … – Free texts • Ex: United States, non-profit organization, … – Simple Boolean search • Ex: open source AND Linux, software engineer OR project manager, … • Inverted Index – As generated in HW#1 Output: Ranked Search Results • A ranked list of search results from ClueWeb09 collection – Ranking: vector space model • Term weighting scheme: TF-IDF • Similarity estimation: cosine similarity between query and document vectors wij = (1+ log tfij) * log (N/dfi) d j ( w1, j , w2, j , , wt , j ) q ( w1,q , w2,q , , wt ,q ) dj q sim(d j , q) | dj || q | w w w t i 1 t i 1 2 i, j i, j i ,q t i 1 wi2,q Example Output • Ex: – Query: “Hong Kong” – Result: <doc#> <similarity score> • 261 0.85 135 0.67 324 0.3 … Optional Features • Optional functionalities – Better user interface for search – Complex queries: phrase, wildcard, substring, proximity search, combinations of Boolean operators, … (Ch.2 & 3) – Query processing: spell-correction, phonetic correction, … (Ch.3) – Different term weighting schemes: variants of TFIDF, … (Ch.6) – In-exact top-k retrieval: index elimination, champion lists, impact-ordering, tiered index, … (Ch.7) – Able to be turned on/off by a parameter trigger Submission • Your submission *should* include – The source code (and your configurations of extra libraries) • For utilizing open source tools, please also submit your source code on calling the APIs or libraries – A one-page documentation including • Major features: ex: high efficiency, low storage, multiple input formats, huge corpus, … • Major difficulties encountered • Special requirements for execution environments (ex: Java Runtime Environment, special compilers, …) • Team members list: the names and the responsible parts of each individual member should be clearly identified • Due: three weeks (Apr. 27, 2015) Submission Instructions • Programs and related electronic files in your homework must be submitted directly on the submission site: – Submission site: http://140.124.183.13/ – Preparing your submission file: as one single compressed file • Name your file according to your ID such as <ID>_HW2.zip. • Remember to specify the names of your team members and student ID in the files and documentation – If you cannot successfully submit your work, please contact with the TA (@ R1424, Technology Building) Evaluation • Minimum requirement: correctness for simple queries – Some example queries from ClueWeb09 Test Collection will be submitted to your program, and the ranked list will be checked for effectiveness • Optional features will be considered as bonus – Various query types, weighting schemes, efficient scoring and ranking, … • You might be required to demo if the program submitted was unable to run by the TA Any Questions or Comments?