HW#2

advertisement
IR Homework #2
By J. H. Wang
Mar. 31, 2015
Programming Exercise #2:
Query Processing and Searching
• Goal: to search relevant documents for a
given query
• Input: a query (and the inverted index)
– (simple search: keyword, Boolean)
• Output: a ranked list of search results
from ClueWeb09 collection
– (details to be described later)
Input: User Query and Inverted
Index
• Simple queries
– Single keywords
• Ex: Microsoft, airplanes, …
– Free texts
• Ex: United States, non-profit organization, …
– Simple Boolean search
• Ex: open source AND Linux, software engineer OR
project manager, …
• Inverted Index
– As generated in HW#1
Output: Ranked Search Results
• A ranked list of search results from
ClueWeb09 collection
– Ranking: vector space model
• Term weighting scheme: TF-IDF
• Similarity estimation: cosine similarity between
query and document vectors
wij = (1+ log tfij) * log (N/dfi)

d j  ( w1, j , w2, j ,  , wt , j )

q  ( w1,q , w2,q ,  , wt ,q )
 
dj q
sim(d j , q)  

| dj || q |
 w w
 w  
t

i 1
t
i 1
2
i, j
i, j
i ,q
t
i 1
wi2,q
Example Output
• Ex:
– Query: “Hong Kong”
– Result: <doc#> <similarity score>
• 261 0.85
135 0.67
324 0.3
…
Optional Features
• Optional functionalities
– Better user interface for search
– Complex queries: phrase, wildcard, substring,
proximity search, combinations of Boolean
operators, … (Ch.2 & 3)
– Query processing: spell-correction, phonetic
correction, … (Ch.3)
– Different term weighting schemes: variants of TFIDF, … (Ch.6)
– In-exact top-k retrieval: index elimination, champion
lists, impact-ordering, tiered index, … (Ch.7)
– Able to be turned on/off by a parameter trigger
Submission
• Your submission *should* include
– The source code (and your configurations of extra libraries)
• For utilizing open source tools, please also submit your source
code on calling the APIs or libraries
– A one-page documentation including
• Major features: ex: high efficiency, low storage, multiple input
formats, huge corpus, …
• Major difficulties encountered
• Special requirements for execution environments (ex: Java
Runtime Environment, special compilers, …)
• Team members list: the names and the responsible parts of each
individual member should be clearly identified
• Due: three weeks (Apr. 27, 2015)
Submission Instructions
• Programs and related electronic files in your
homework must be submitted directly on the
submission site:
– Submission site: http://140.124.183.13/
– Preparing your submission file: as one single
compressed file
• Name your file according to your ID such as <ID>_HW2.zip.
• Remember to specify the names of your team members and
student ID in the files and documentation
– If you cannot successfully submit your work, please
contact with the TA (@ R1424, Technology Building)
Evaluation
• Minimum requirement: correctness for simple
queries
– Some example queries from ClueWeb09 Test
Collection will be submitted to your program, and the
ranked list will be checked for effectiveness
• Optional features will be considered as bonus
– Various query types, weighting schemes, efficient
scoring and ranking, …
• You might be required to demo if the program
submitted was unable to run by the TA
Any Questions or Comments?
Download