Information Retrieval and Data Mining

Information Retrieval and Data Mining
Paul – Alexandru Chirita, Ph.D.
© 2010 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
About Me

The Academic Part:


Education:

B.Sc., Ecole Polytechnique, CS Dept., Paris, France

Dipl.-Ing., Automatics & Computer Science Faculty, “Politehnica” Univ. Bucharest, Romania

Ph.D., Information Retrieval & Data Mining (Summa Cum Laude), Univ. of Hannover,
Germany
Technical Program Committee Member for all major conferences & journals in
Information Retrieval & Data Mining:

ACM SIGIR, WWW (W3C), ECIR, ACM TOIS, IEEE TKDE, IPM (Elsevier), etc.

2 books on C/C++ programming and algorithms

About 25 articles published at world-wide top conferences:

http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/c/Chirita:Paul=Alexandru.html
© 2010 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
About Me

The Industrial Part:



Internships:

Schlumberger Industries, Paris, France (Mobile Computing)

Yahoo! Europe, Barcelona, Spain (Data Mining)

Google / Federal University of Amazonas, Manaus, Brazil (Information Retrieval)
Jobs:

L3S Research Center, Hannover, Germany (academic & industrial research, Information
Retrieval)

Adobe Systems, Bucharest, Romania (Information Retrieval, Community Tools, and more
recently Advertising & Business Optimization)
Contact:

pchirita@adobe.com
© 2010 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
About The TA: Traian Rebedea


Education:

B.Sc., “Politehnica” University, CS Dept., Bucharest, Romania

M.Sc., “Politehnica” University, CS Dept., Bucharest, Romania

Ph.D. Student, Natural Language Processing, Technology Enhanced Learning,
“Politehnica” University, CS Dept., Bucharest, Romania
Already 11 articles published at world-wide top conferences:


http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/r/Rebedea:Traian.html
Jobs:

Teaching Assistant, “Politehnica” University, CS Dept., Bucharest, Romania

Organizer, “Stagii pe Bune”
© 2010 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
About The Course


You should learn:

How to build & evaluate a Search Engine

Basic Data Mining & Machine Learning algorithms to support the creation of your
search engine
Grading:

Group Research:
20%

Project:
40% (or 25%)
- Must get at least ½ the points here

Exam:
40%
- Must get at least ½ the points here

Course activity:
5%
© 2010 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
About The Group Research

Groups of ~4 persons

Themes:

Advances in Foundational Information Retrieval (2 groups)

Personalized Information Retrieval (2 groups)

Display Advertising Retrieval (2 groups)

Media Advertising Retrieval (1 group)

Customer & Market Discovery (1 smaller group)

Product Discovery (1 smaller group)

Results will be presented in class (2 groups per course day)

https://docs.google.com/spreadsheet/ccc?key=0AuLz70WXga63dHVMS3dU
QVJhX1VfTFRzdG5naURmdVE#gid=0
© 2010 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
About The Project

Groups of 1-5 persons

Your theme proposals are encouraged

Sample (complex) themes:



Opinion Mining Search Engine (about products, companies, etc.)

People Search Engine

Web Personalization Engine (define N experiences for a page, target best
experience for each user)

Advertising Engine (place ads into a web page, possibly also media ads)
Sample easy themes which can be made complex:

Site Specific Search Engine (search a single site, rank results appropriately)

Specialized Search Engine: Hotels only, Jobs only, Housing ads only, etc.
Sample (easy) themes:

Read and summarize some of the most recent articles in IR & DM
© 2010 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Miscellaneous: Dissertation Topics

Contact Traian or myself

May be similar with the semester project, but advanced algorithms are
required here (e.g., for the web personalization engine, I expect to see at
least an implementation of multi-armed bandits).
© 2010 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Code of Honor

You are encouraged to re-use code / any existing libraries, provided that you
explicitly mention that

You are NOT allowed to copy the entire project or large chunks of it

Submission delays result in a grading penalty

10% per week for research / project signup (deadline November 1st)

10% per day for project delivery (deadline January 10th)
© 2010 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
What will you learn to build (1)
© 2010 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
What will you learn to build (2)
© 2010 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
What will you learn to build (3)
© 2010 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
What will you learn to build (4)
© 2010 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Textbook

Christopher Manning, Prabhakar Raghavan, Hinrich Schuetze: Introduction
to Information Retrieval

Free PDF:


http://nlp.stanford.edu/IR-book/information-retrieval-book.html
Buy @ Amazon:

http://www.amazon.com/Introduction-Information-Retrieval-ChristopherManning/dp/0521865719
© 2010 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Other Useful Books

Ian H. Witten, Alistair Moffat, Timothy C. Bell:
Managing Gigabytes: Compressing and Indexing Documents and Images


http://www.amazon.com/Managing-Gigabytes-Compressing-MultimediaInformation/dp/1558605703/ref=sr_1_1?ie=UTF8&s=books&qid=1286739262&sr
=1-1
Ricardo Baeza-Yates, Berthier Ribeiro-Neto:
Modern Information Retrieval

http://www.amazon.com/Modern-Information-Retrieval-Ricardo-BaezaYates/dp/020139829X/ref=sr_1_1?s=books&ie=UTF8&qid=1286739380&sr=1-1
© 2010 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Disclaimers

The vast majority of the content that follows is taken from Stanford’s CS276
course on Information Retrieval & Data Mining

http://www.stanford.edu/class/cs276/

Many thanks to Prabhakar Raghavan for allowing me to re-use this content

The content on these slides has a purely academic nature and has no
relation to Adobe Systems Inc. whatsoever
© 2010 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.