CS523 INFORMATION RETRIEVAL COURSE INTRODUCTION • • YÜCEL SAYGIN • SABANCI UNIVERSITY Contact Info ysaygin@sabanciuniv.edu http://people.sabanciuniv.edu/~ysaygin Tel : 9576 No Specific office hours. You can drop by anytime you like. Email or call me to make sure I am at the office. Course Info Reference Book: Introduction to Information Retrieval, Authors: Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze Publisher:Cambridge University Press. 2008. Course Info Grading: Homework : 10% Project : 40% Paper presentation : 20% Term Paper : 20% Attendance during paper presentations: 10% Topics that will be covered Document Retrieval Techniques Information Retrieval on the Web Data Mining for Information Retrieval Aim of the course Knowledge: To introduce information retrieval techniques Skills: paper reading and presentation research and/or project work A Rough Schedule October, November: Lectures on various information retrieval techniques Remaining weeks: Paper and research project presentations What I will do Give the basics on information retrieval Project supervision Give directions and advise on the projects Coordination of the presentations What I expect you to do Understand the basic concepts of Information Retrieval Choose a specific area and two related papers on the same topic for presentation in class Attendance is required for paper presentations and you will loose 2% of your overall grade for each presentation you missed. Write a term paper on the two papers presented. Do a project and a final report describing what you learned or achieved in the scope of the project. Sources TREC Conference http://trec.nist.gov/ SIGIR Conference http://www.sigir.org/ WWW Conference http://www2004.org/ ACM TOIS Journal SIGMOD, VLDB, ICDE Conferences (database perspective) SIGKDD, ICDM Conferences (data mining perspective) Tools SMART IR (Cornell Univ.) http://www.cs.cornell.edu/Info/Projects/NLP/ Glimpse from Univ. Arizona http://webglimpse.net/ Google Altavista Yahoo Information Retrieval Refers to the retrieval of any type of information such as Structured data (e.g. relational database) Text (We will focus on this) Video Image, sound DNA Document Retrieval User Query Static Document Collection Ranked Result •Document Collection is previously indexed •User query is ad hoc •Results are ranked wrt their similarity to the user query Document Routing User profiles are set in advance Incoming documents are directed to relevant users Useful for redirecting corporate emails to relevant departments (sales, marketing, support etc) Performance Metrics for IR Precision Recall Not practical to have good precision and recall Retrieved Documents Relevant Documents Whole Document Space Relevant and Retrieved Documents First Reading for Tomorrow The Anatomy of a Large-Scale Hypertextual Web Search Engine (WWW Conference 1998) paper by Sergey Brin and Lawrence Page www-db.stanford.edu/~backrub/google.html Web Information Retrieval Two possible ways: Use the web structure starting from a location like yahoo where things are categorized Use search engines Web Information Retrieval Challenges Scale: Hundreds of millions of queries per day Web grows, continuous crawling is needed Obstacles due to OS, and disk seek time Google handles large data sets by indexing and compression Search quality is important Completeness of the index is important But ranking is also of utmost importance due to the size of the Web Web Information Retrieval Ranking (of google) The idea is to give importance to pages that have a lot of back links Similar to the notion of citations in academia A link graph of the web was formed and maintained (518 million links in 1998 for the prototype) Web Mining (focused) Crawling and Indexing Topic Directories Clustering and Classification Hyperlink Analysis Personalization (profiles, preferences)