Course Introduction - Center for Software Engineering

advertisement
CSCI 572: Information Retrieval and
Search Engines: Summer 2010
Prof. Chris A. Mattmann
The Class
• Will give you a complete treatment of the area of search engines and
information retrieval
– The fundamental building blocks of the web and search engines
• The Search Engine Architecture proposed by Brin/Page
• Understanding algorithms for ranking pages
• Understanding technologies for characterizing, downloading,
parsing, indexing, searching and disseminating web content
– Advanced topics in search engines such as BigData and distributed
computation
• Will equip you with the necessary skills to design complex, realworld search engines
May-20-10
CS572-Summer2010
CAM-2
General class information
• Lecture, but…
– You can participate
– You should participate
– You will participate, that is, if you want to do well :)
• Breakdown of points
– 20% participation
– 40% research paper presentation
– 40% course project
May-20-10
CS572-Summer2010
CAM-3
General class information
• Syllabus/website:
http://sunset.usc.edu/classes/cs572_2010/
– Visit it often, as the schedule may change!
– This is where all of your course project info and
presentation info will be posted
– This site will point you to required reading (research
papers), and to lectures that you can download before
class
May-20-10
CS572-Summer2010
CAM-4
What we’ll cover
• Theory
–
–
–
–
–
Understanding of basic information retrieval
Search engine querying
Search engine ranking
Architecture of search engines and technologies
Design Patterns
• Practice
– Modern search engine technologies from Apache
May-20-10
CS572-Summer2010
CAM-5
Course Presentation
• Each week, we’ll read a few research papers on
search engines
• For the first part of the course (5 weeks), I’ll
lecture on the general topics that the research
papers cover
– The search engine architecture: fetching, parsing,
indexing, querying, distributed computation, etc.
• For the last part of the course (~5 weeks), each one
of you will present on one of the research papers
we covered in the first 5 weeks
May-20-10
CS572-Summer2010
CAM-6
Course Presentation
• What I’m looking for (~20 minutes of presentation, with ~5
mins questions at the end)
– You understood the paper
– Discussion of related work and background
– Discussion of why should I care about the topic
• And more importantly why your fellow classmates should care
– Relation of your paper to the lecture slides I gave on the topic
– Simple summarization and description of the algorithm and/or
technology introduced in the paper
– What were the results/contributions/conclusions of the paper
– Your evaluation of Pros of the paper
– Your evaluation of Cons of the paper
May-20-10
CS572-Summer2010
CAM-7
Course Presentation
• What I’m NOT looking for
–
–
–
–
–
Plagiarism
Repetition
Cutting/Pasting out of the paper
Regurgitation
You to follow the EXACT set of bullets that I gave on the prior slide
• You should be looking to be innovative – show the class and
me that you really understood what was in the paper
– Treat it like a conference presentation
May-20-10
CS572-Summer2010
CAM-8
Course Project
• You will get to leverage one or a combination of several
Apache software technologies
– Nutch, Tika, Lucene, Solr, Hadoop, HBase, Hive, Cassandra, etc.
• You will make a significant contribution to one or more of
the above communities
• Deliverables
– A 2 page project proposal
– A 2 page mid-term project report
– Source code and final demonstration to me at end of class
May-20-10
CS572-Summer2010
CAM-9
Course Project
• Deliverables
– Your project proposal should include:
• Demonstration that you’ve researched your particular idea with
pointers to issue trackers and mailing lists
• Objectives section
• Approach section
• Identification of deliverables section
• Timeline/Schedule
– Your mid term report should include:
• Current status
• Blockers to completion
• Planned mitigation to blockers
May-20-10
CS572-Summer2010
CAM-10
Me
•
Graduated with my Ph.D. in Computer
Science from USC in 2007
– Advisor: Dr. Nenad Medvidovic
•
Was a student at USC from 1998-2007
– B.S., Computer Science 2001
– M.S., Computer Science 2003
•
My research interests
– The intersection of software
architectures, and large-scale data
dissemination
– Software connector selection
– Bayesian decision theory
– Reinforcement learning
– Search Engines
May-20-10
CS572-Summer2010
CAM-11
So…today
• Quick lecture on characterizing the web
• Read the papers linked from the syllabus
• Be ready for next Tuesday as this is a 10-week
course and we are going to dive in
May-20-10
CS572-Summer2010
CAM-12
Download