CSCI 572: Information Retrieval and Search Engines: Summer 2010 Prof. Chris A. Mattmann The Class • Will give you a complete treatment of the area of search engines and information retrieval – The fundamental building blocks of the web and search engines • The Search Engine Architecture proposed by Brin/Page • Understanding algorithms for ranking pages • Understanding technologies for characterizing, downloading, parsing, indexing, searching and disseminating web content – Advanced topics in search engines such as BigData and distributed computation • Will equip you with the necessary skills to design complex, realworld search engines May-20-10 CS572-Summer2010 CAM-2 General class information • Lecture, but… – You can participate – You should participate – You will participate, that is, if you want to do well :) • Breakdown of points – 20% participation – 40% research paper presentation – 40% course project May-20-10 CS572-Summer2010 CAM-3 General class information • Syllabus/website: http://sunset.usc.edu/classes/cs572_2010/ – Visit it often, as the schedule may change! – This is where all of your course project info and presentation info will be posted – This site will point you to required reading (research papers), and to lectures that you can download before class May-20-10 CS572-Summer2010 CAM-4 What we’ll cover • Theory – – – – – Understanding of basic information retrieval Search engine querying Search engine ranking Architecture of search engines and technologies Design Patterns • Practice – Modern search engine technologies from Apache May-20-10 CS572-Summer2010 CAM-5 Course Presentation • Each week, we’ll read a few research papers on search engines • For the first part of the course (5 weeks), I’ll lecture on the general topics that the research papers cover – The search engine architecture: fetching, parsing, indexing, querying, distributed computation, etc. • For the last part of the course (~5 weeks), each one of you will present on one of the research papers we covered in the first 5 weeks May-20-10 CS572-Summer2010 CAM-6 Course Presentation • What I’m looking for (~20 minutes of presentation, with ~5 mins questions at the end) – You understood the paper – Discussion of related work and background – Discussion of why should I care about the topic • And more importantly why your fellow classmates should care – Relation of your paper to the lecture slides I gave on the topic – Simple summarization and description of the algorithm and/or technology introduced in the paper – What were the results/contributions/conclusions of the paper – Your evaluation of Pros of the paper – Your evaluation of Cons of the paper May-20-10 CS572-Summer2010 CAM-7 Course Presentation • What I’m NOT looking for – – – – – Plagiarism Repetition Cutting/Pasting out of the paper Regurgitation You to follow the EXACT set of bullets that I gave on the prior slide • You should be looking to be innovative – show the class and me that you really understood what was in the paper – Treat it like a conference presentation May-20-10 CS572-Summer2010 CAM-8 Course Project • You will get to leverage one or a combination of several Apache software technologies – Nutch, Tika, Lucene, Solr, Hadoop, HBase, Hive, Cassandra, etc. • You will make a significant contribution to one or more of the above communities • Deliverables – A 2 page project proposal – A 2 page mid-term project report – Source code and final demonstration to me at end of class May-20-10 CS572-Summer2010 CAM-9 Course Project • Deliverables – Your project proposal should include: • Demonstration that you’ve researched your particular idea with pointers to issue trackers and mailing lists • Objectives section • Approach section • Identification of deliverables section • Timeline/Schedule – Your mid term report should include: • Current status • Blockers to completion • Planned mitigation to blockers May-20-10 CS572-Summer2010 CAM-10 Me • Graduated with my Ph.D. in Computer Science from USC in 2007 – Advisor: Dr. Nenad Medvidovic • Was a student at USC from 1998-2007 – B.S., Computer Science 2001 – M.S., Computer Science 2003 • My research interests – The intersection of software architectures, and large-scale data dissemination – Software connector selection – Bayesian decision theory – Reinforcement learning – Search Engines May-20-10 CS572-Summer2010 CAM-11 So…today • Quick lecture on characterizing the web • Read the papers linked from the syllabus • Be ready for next Tuesday as this is a 10-week course and we are going to dive in May-20-10 CS572-Summer2010 CAM-12