IR_lab4_stemming_2015

advertisement
Information retrieval
Lab 4 Improving the index
In this lab you will improve the index you made in the last lab by adding facilities
to deal with punctuation, numbers, dates and to do stemming.
Part 1: Stemming
Experiment on the ‘ordinary’ words from the Moby common words list, using
three different stemmers (from https://pypi.python.org/pypi/stemming/1.0
and UEAlite), for example Lovins, Porter, UEAlite.
For each stemmer, build a list of original and stemmed words.
What’s the vocabulary size for each?
How many terms are being conflated? (estimate from a sample if necessary)
What sort of terms are they? How much difference might it make for retrieval?
Part 2: Punctuation, numbers and dates
Working with the subset of the Reutes21578 corpus on Blackboard, write Python
functions to
1. Count and strip punctuation.
2. Identify numbers and numeric expressions.
3. Identify dates and times.
Download