Coursework: guidance notes and suggestions

CSM06 Information Retrieval
Notes for Coursework
Dr Andrew Salway (
NB. These notes are offered as suggestions only and do not constitute instructions for the
A good report will be one that demonstrates an in-depth understanding of one idea to do
with information retrieval. This does NOT mean that you are trying to make the best
possible information retrieval system, nor a complete system. You may conclude that the
idea you investigate is not such a good one, and still get good marks so long as you can
show why you think that is so. You should start with an idea that has been reported in the
information retrieval literature, though you may propose modifications to it based on the
work you carry out.
The rest of these notes are organized as follows:
1) Suggestions for ideas to review, implement and experiment with, and some starting
points for reading
2) Suggested Plan of Action
3) Notes About Literature Review
4) Notes About System Implemention
5) Notes About Experimentation and Data Sets
1) Suggestions for ideas to review, implement and experiment with and starting
points for reading
The aim is to evaluate an existing idea, and optionally to modify it. The following topics
are suggested, along with some literature to use as a starting point: These topics will all
be covered in lectures during the module. You may choose one of these, or one other
idea. You should confirm you idea with AJS by Tuesday 11th October at the latest.
a) Term Clusters for Query Expansion to solve the problem of synonymy. It is claimed
that query expansion can help to reduce the problem of synonymy. Techniques for
automatic query expansion have been proposed that are based on the idea of making term
clusters from global analysis, and from local analysis. READING: Xu and Croft (1996),
“Query Expansion Using Local and Global Document Analysis” Available Online; see
also Baeza-Yates and Ribeiro-Neto (1999), Modern Information Retrieval, pp. 124-127,
book available in Library and copies of these pages also in Library Article Collection.
b) Measuring the ‘Quality’ of a Webpage with PageRank. The ways in which
webpages are connected to one another (or not) might provide evidence about whether
they are high quality, or not. This idea has been used as part of algorithms to rank results
in web search engines. READING: Brin and Page, “The Anatomy of a Large-Scale
Hypertextual Web Search Engine”, Available Online; see also Baldi, Frasconi and Smyth
(2003), Modeling the Internet and the Web, Section 5.4 (available from AJS).
c) Finding Webpages Similar to One that the User is Interested In. Instead of making a
query, users might like to say ‘find me more pages like this one’: a search engine then
needs a way to compare pages and determine which are similar. READING: Dean and
Henzinger, “Finding Related Pages in the World Wide Web”, Available Online.
d) Document Clustering for Visualising the Result Set. If a query produces a lot of
results it might help the user if the results are organised into clusters so the user can see
what kinds of topics are covered by the documents in the result set. READING: Baldi,
Frasconi and Smyth (2003), Modeling the Internet and the Web, Section 4.8; see also the
CLUTO system – Available Online.
e) Image Retrieval by Keywords. To date systems web search engines that do image
retrieval have relied on indexing images with keywords extracted from the HTML on
webpages containing, or pointing to, the images. READING: Smith and Chang,
“Searching for Images and Videos on the World Wide Web”, Available Online; also, try
searching for images with Google, AltaVista, etc.
f) Turning Users’ Questions into Queries. Perhaps information retrieval systems would
be more ‘user-friendly’ if users could ask normal kinds of questions, instead of having to
type in just keywords. The TRITUS system was developed to learn how to transform a
question from a user into a query for a web search engine. READING: Agichtein,
Lawrence and Gravano (2001), “Learning Search Engine Specific Query Transformations
for Question Answering”, Procs. 10th International World Wide Web Conference,
WW10. Available Online.
g) Categorising Queries according to geographic location. Some queries that people
make to search engines have an implicit geographical aspect, e.g. if I query ‘houses for
sale’ then I probably mean ‘in Guildford’, whereas other queries may be more global. If
a search engine can recognise this, then maybe it could provide the user with better
results. READING: Gravano, Hatzivassiloglou and Lichtenstein (2003), “Categorizing
Web Queries According to Geographical Locality”, Procs. ACM Conference on
Information and Knowledge Management CIKM 2003. Available Online.
h) Reducing the Problem of Synonymy with Latent Semantic Indexing. The technique
of Latent Semantic Indexing exploits the fact that words that tend to co-occur in
documents may be related semantically. LSI is based on the Vector Space Model, but it
transforms the original vector space by performing a dimensionality reduction – the
dimensions in the reduced space reflect the latent semantics of the document set.
READING: Baldi, Frasconi and Smyth (2003), Modeling the Internet and the Web,
Section 4.5; see also the website of Telcordia Technologies for various papers and demos
NB. This topic has a relatively high mathematical content
i) Combining Image and Text Features for Image Classification and Retrieval. Maybe
systems can learn to recognise more about the content of images by being trained on sets
of features from images and texts that co-occur on webpages. READING: Yanai (2003),
“Generic Image Classification Using Visual Knowledge on the Web”, Procs of ACM
Multimedia 2003, pp. 167-176. May be available online: otherwise see AJS. See also,
Barnard et al (2003), “Matching Words and Pictures”, Journal of Machine Learning
Research 3, pp. 1107-1135 – available online.
NB. This topic has a relatively high mathematical content
2) Suggested Plan of Action
***NOW***: Week 2 of module
 Form group.
 Find out each person’s skills; swap email addresses; fix meeting times.
 Plan how you will decide which idea to investigate, e.g. each person could look
into some different options and then report back to the group
Week 3 of module
 Meet to discuss and decide on what idea to investigate: confirm this with AJS
 Work out who will do what, and agree some deadlines. Some tasks might be:
further literature review; evaluating existing software that could be used for
implementation; planning experiments and getting necessary data
 Do some quick web searches to identify any other useful literature or resources to
do with your chosen topic
Weeks 4 of module
 All group members should be familiar with essential literature by now
 Finalise system specification, e.g. what is the input and what is the output?
 Finalise system design based on idea from literature.
 Allocate further system development tasks to group members, e.g.
implementation / testing / experimentation
Weeks 5-8 of module: System Implementation, Testing and Experimentation.
You may not have time to implement all that you hoped to – that’s ok. However it is
important that you are able to process some input data and get some output, so that you
can begin to experiment with the system. You need to think about some experiments you
can carry out in order to evaluate the idea you have implemented.
Week 8
 Meet to prepare for presentation. Make sure all group members know about and
have access to all results
 Group presentation: make a good presentation and get good feedback from the
class – this will help in writing your report.
Weeks 9-10: DEADLINE ***Monday 21st November***
 Concentrate on individual report writing.
3) Notes About Literature Review
 Each group should concentrate on reading only a small number of papers in detail,
possibly only one, and probably not more than three.
It is likely that you will need to look briefly at more papers than this in order to
identify your essential papers.
You should only read in detail beyond the essential papers if you find that you
need more information for your implementation / experimentation.
When reading a paper try to extract the key idea that the authors are proposing.
4) Notes About System Implemention
 The reuse of source code and existing software is recommended so long as it is
properly attributed, and so long as you understand how it is working. This
coursework is not intended to test your programming skills – so no unnecessary
coding please!
You are not being asked to produce a ‘polished’ final system, rather, you are
required to implement something that can be used to carry out some experiments
to test the idea that you are investigating. Thus it might be that your
implementation actually comprises several different existing systems that you
pass data between (manually if necessary). For example, you might take some
output from System Quirk text analysis and put it into Microsoft Excel for further
For text analysis, consider using:
System Quirk: available in AP Labs, and available online for download
VisualText: information available online
For statistical / mathematical processing, consider using:
Microsoft Excel: available in AP Labs
Matlab: available in AP Labs
Note, Google distributes an API that you can use to access its databases
See: Google homepage and the book ‘Google Hacks’
For other freeware IR tools see:
Search Tools:
Web IR pages: ;
For document clustering, consider using:
CLUTO: available online
5) About Experimentation and Data Sets
The point of doing some experiments with your system is to get some evidence that lets
you say something about how good (or not) the idea from the literature is, and about how
well you’ve been able to implement it, and perhaps modify it.
You need to be very clear and focussed about what it is you are testing when you
plan your experiments: you should also plan how you will analyse the results of
your experiment, and think about what the results will tell you
You will need some data, the details of which will depend on what you are doing,
but may for example include: a set of documents, a set of queries (with relevant
documents given), etc. Note that there are a number of publicly available data
sets that might be appropriate for your needs. Consider the Cystic Fibrosis
collection (see Lecture 1 slides); web search engine query logs; data from the
Open Directory Project, etc.
You may also need some human ‘judges’ to evaluate the output of your system;
think about the way Precision and Recall are used to evaluate information
retrieval systems