CSM06 Information Retrieval Notes for Coursework Dr Andrew Salway (a.salway@surrey.ac.uk) NB. These notes are offered as suggestions only and do not constitute instructions for the coursework. A good report will be one that demonstrates an in-depth understanding of one idea to do with information retrieval. This does NOT mean that you are trying to make the best possible information retrieval system, nor a complete system. You may conclude that the idea you investigate is not such a good one, and still get good marks so long as you can show why you think that is so. You should start with an idea that has been reported in the information retrieval literature, though you may propose modifications to it based on the work you carry out. The rest of these notes are organized as follows: 1) Suggestions for ideas to review, implement and experiment with, and some starting points for reading 2) Suggested Plan of Action 3) Notes About Literature Review 4) Notes About System Implemention 5) Notes About Experimentation and Data Sets 1) Suggestions for ideas to review, implement and experiment with and starting points for reading The aim is to evaluate an existing idea, and optionally to modify it. The following topics are suggested, along with some literature to use as a starting point: These topics will all be covered in lectures during the module. You may choose one of these, or one other idea. You should confirm you idea with AJS by Tuesday 11th October at the latest. a) Term Clusters for Query Expansion to solve the problem of synonymy. It is claimed that query expansion can help to reduce the problem of synonymy. Techniques for automatic query expansion have been proposed that are based on the idea of making term clusters from global analysis, and from local analysis. READING: Xu and Croft (1996), “Query Expansion Using Local and Global Document Analysis” Available Online; see also Baeza-Yates and Ribeiro-Neto (1999), Modern Information Retrieval, pp. 124-127, book available in Library and copies of these pages also in Library Article Collection. b) Measuring the ‘Quality’ of a Webpage with PageRank. The ways in which webpages are connected to one another (or not) might provide evidence about whether they are high quality, or not. This idea has been used as part of algorithms to rank results in web search engines. READING: Brin and Page, “The Anatomy of a Large-Scale Hypertextual Web Search Engine”, Available Online; see also Baldi, Frasconi and Smyth (2003), Modeling the Internet and the Web, Section 5.4 (available from AJS). c) Finding Webpages Similar to One that the User is Interested In. Instead of making a query, users might like to say ‘find me more pages like this one’: a search engine then needs a way to compare pages and determine which are similar. READING: Dean and Henzinger, “Finding Related Pages in the World Wide Web”, Available Online. d) Document Clustering for Visualising the Result Set. If a query produces a lot of results it might help the user if the results are organised into clusters so the user can see what kinds of topics are covered by the documents in the result set. READING: Baldi, Frasconi and Smyth (2003), Modeling the Internet and the Web, Section 4.8; see also the CLUTO system – Available Online. e) Image Retrieval by Keywords. To date systems web search engines that do image retrieval have relied on indexing images with keywords extracted from the HTML on webpages containing, or pointing to, the images. READING: Smith and Chang, “Searching for Images and Videos on the World Wide Web”, Available Online; also, try searching for images with Google, AltaVista, etc. f) Turning Users’ Questions into Queries. Perhaps information retrieval systems would be more ‘user-friendly’ if users could ask normal kinds of questions, instead of having to type in just keywords. The TRITUS system was developed to learn how to transform a question from a user into a query for a web search engine. READING: Agichtein, Lawrence and Gravano (2001), “Learning Search Engine Specific Query Transformations for Question Answering”, Procs. 10th International World Wide Web Conference, WW10. Available Online. g) Categorising Queries according to geographic location. Some queries that people make to search engines have an implicit geographical aspect, e.g. if I query ‘houses for sale’ then I probably mean ‘in Guildford’, whereas other queries may be more global. If a search engine can recognise this, then maybe it could provide the user with better results. READING: Gravano, Hatzivassiloglou and Lichtenstein (2003), “Categorizing Web Queries According to Geographical Locality”, Procs. ACM Conference on Information and Knowledge Management CIKM 2003. Available Online. h) Reducing the Problem of Synonymy with Latent Semantic Indexing. The technique of Latent Semantic Indexing exploits the fact that words that tend to co-occur in documents may be related semantically. LSI is based on the Vector Space Model, but it transforms the original vector space by performing a dimensionality reduction – the dimensions in the reduced space reflect the latent semantics of the document set. READING: Baldi, Frasconi and Smyth (2003), Modeling the Internet and the Web, Section 4.5; see also the website of Telcordia Technologies for various papers and demos – http://lsi.research.telcordia.com NB. This topic has a relatively high mathematical content i) Combining Image and Text Features for Image Classification and Retrieval. Maybe systems can learn to recognise more about the content of images by being trained on sets of features from images and texts that co-occur on webpages. READING: Yanai (2003), “Generic Image Classification Using Visual Knowledge on the Web”, Procs of ACM Multimedia 2003, pp. 167-176. May be available online: otherwise see AJS. See also, Barnard et al (2003), “Matching Words and Pictures”, Journal of Machine Learning Research 3, pp. 1107-1135 – available online. NB. This topic has a relatively high mathematical content 2) Suggested Plan of Action ***NOW***: Week 2 of module Form group. Find out each person’s skills; swap email addresses; fix meeting times. Plan how you will decide which idea to investigate, e.g. each person could look into some different options and then report back to the group Week 3 of module Meet to discuss and decide on what idea to investigate: confirm this with AJS Work out who will do what, and agree some deadlines. Some tasks might be: further literature review; evaluating existing software that could be used for implementation; planning experiments and getting necessary data Do some quick web searches to identify any other useful literature or resources to do with your chosen topic Weeks 4 of module All group members should be familiar with essential literature by now Finalise system specification, e.g. what is the input and what is the output? Finalise system design based on idea from literature. Allocate further system development tasks to group members, e.g. implementation / testing / experimentation Weeks 5-8 of module: System Implementation, Testing and Experimentation. You may not have time to implement all that you hoped to – that’s ok. However it is important that you are able to process some input data and get some output, so that you can begin to experiment with the system. You need to think about some experiments you can carry out in order to evaluate the idea you have implemented. Week 8 Meet to prepare for presentation. Make sure all group members know about and have access to all results Group presentation: make a good presentation and get good feedback from the class – this will help in writing your report. Weeks 9-10: DEADLINE ***Monday 21st November*** Concentrate on individual report writing. 3) Notes About Literature Review Each group should concentrate on reading only a small number of papers in detail, possibly only one, and probably not more than three. It is likely that you will need to look briefly at more papers than this in order to identify your essential papers. You should only read in detail beyond the essential papers if you find that you need more information for your implementation / experimentation. When reading a paper try to extract the key idea that the authors are proposing. 4) Notes About System Implemention The reuse of source code and existing software is recommended so long as it is properly attributed, and so long as you understand how it is working. This coursework is not intended to test your programming skills – so no unnecessary coding please! You are not being asked to produce a ‘polished’ final system, rather, you are required to implement something that can be used to carry out some experiments to test the idea that you are investigating. Thus it might be that your implementation actually comprises several different existing systems that you pass data between (manually if necessary). For example, you might take some output from System Quirk text analysis and put it into Microsoft Excel for further processing. For text analysis, consider using: System Quirk: available in AP Labs, and available online for download VisualText: information available online For statistical / mathematical processing, consider using: Microsoft Excel: available in AP Labs Matlab: available in AP Labs Note, Google distributes an API that you can use to access its databases See: Google homepage and the book ‘Google Hacks’ For other freeware IR tools see: Search Tools: www.searchtools.com Web IR pages: http://www.webir.org/ ; http://www.webir.org/resources.html For document clustering, consider using: CLUTO: available online 5) About Experimentation and Data Sets The point of doing some experiments with your system is to get some evidence that lets you say something about how good (or not) the idea from the literature is, and about how well you’ve been able to implement it, and perhaps modify it. You need to be very clear and focussed about what it is you are testing when you plan your experiments: you should also plan how you will analyse the results of your experiment, and think about what the results will tell you You will need some data, the details of which will depend on what you are doing, but may for example include: a set of documents, a set of queries (with relevant documents given), etc. Note that there are a number of publicly available data sets that might be appropriate for your needs. Consider the Cystic Fibrosis collection (see Lecture 1 slides); web search engine query logs; data from the Open Directory Project, etc. You may also need some human ‘judges’ to evaluate the output of your system; think about the way Precision and Recall are used to evaluate information retrieval systems