Information Retrieval and Data Mining Paul – Alexandru Chirita, Ph.D. © 2010 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. About Me The Academic Part: Education: B.Sc., Ecole Polytechnique, CS Dept., Paris, France Dipl.-Ing., Automatics & Computer Science Faculty, “Politehnica” Univ. Bucharest, Romania Ph.D., Information Retrieval & Data Mining (Summa Cum Laude), Univ. of Hannover, Germany Technical Program Committee Member for all major conferences & journals in Information Retrieval & Data Mining: ACM SIGIR, WWW (W3C), ECIR, ACM TOIS, IEEE TKDE, IPM (Elsevier), etc. 2 books on C/C++ programming and algorithms About 25 articles published at world-wide top conferences: http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/c/Chirita:Paul=Alexandru.html © 2010 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. About Me The Industrial Part: Internships: Schlumberger Industries, Paris, France (Mobile Computing) Yahoo! Europe, Barcelona, Spain (Data Mining) Google / Federal University of Amazonas, Manaus, Brazil (Information Retrieval) Jobs: L3S Research Center, Hannover, Germany (academic & industrial research, Information Retrieval) Adobe Systems, Bucharest, Romania (Information Retrieval, Community Tools, and more recently Advertising & Business Optimization) Contact: pchirita@adobe.com © 2010 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. About The TA: Traian Rebedea Education: B.Sc., “Politehnica” University, CS Dept., Bucharest, Romania M.Sc., “Politehnica” University, CS Dept., Bucharest, Romania Ph.D. Student, Natural Language Processing, Technology Enhanced Learning, “Politehnica” University, CS Dept., Bucharest, Romania Already 11 articles published at world-wide top conferences: http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/r/Rebedea:Traian.html Jobs: Teaching Assistant, “Politehnica” University, CS Dept., Bucharest, Romania Organizer, “Stagii pe Bune” © 2010 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. About The Course You should learn: How to build & evaluate a Search Engine Basic Data Mining & Machine Learning algorithms to support the creation of your search engine Grading: Group Research: 20% Project: 40% (or 25%) - Must get at least ½ the points here Exam: 40% - Must get at least ½ the points here Course activity: 5% © 2010 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. About The Group Research Groups of ~4 persons Themes: Advances in Foundational Information Retrieval (2 groups) Personalized Information Retrieval (2 groups) Display Advertising Retrieval (2 groups) Media Advertising Retrieval (1 group) Customer & Market Discovery (1 smaller group) Product Discovery (1 smaller group) Results will be presented in class (2 groups per course day) https://docs.google.com/spreadsheet/ccc?key=0AuLz70WXga63dHVMS3dU QVJhX1VfTFRzdG5naURmdVE#gid=0 © 2010 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. About The Project Groups of 1-5 persons Your theme proposals are encouraged Sample (complex) themes: Opinion Mining Search Engine (about products, companies, etc.) People Search Engine Web Personalization Engine (define N experiences for a page, target best experience for each user) Advertising Engine (place ads into a web page, possibly also media ads) Sample easy themes which can be made complex: Site Specific Search Engine (search a single site, rank results appropriately) Specialized Search Engine: Hotels only, Jobs only, Housing ads only, etc. Sample (easy) themes: Read and summarize some of the most recent articles in IR & DM © 2010 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Miscellaneous: Dissertation Topics Contact Traian or myself May be similar with the semester project, but advanced algorithms are required here (e.g., for the web personalization engine, I expect to see at least an implementation of multi-armed bandits). © 2010 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Code of Honor You are encouraged to re-use code / any existing libraries, provided that you explicitly mention that You are NOT allowed to copy the entire project or large chunks of it Submission delays result in a grading penalty 10% per week for research / project signup (deadline November 1st) 10% per day for project delivery (deadline January 10th) © 2010 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. What will you learn to build (1) © 2010 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. What will you learn to build (2) © 2010 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. What will you learn to build (3) © 2010 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. What will you learn to build (4) © 2010 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Textbook Christopher Manning, Prabhakar Raghavan, Hinrich Schuetze: Introduction to Information Retrieval Free PDF: http://nlp.stanford.edu/IR-book/information-retrieval-book.html Buy @ Amazon: http://www.amazon.com/Introduction-Information-Retrieval-ChristopherManning/dp/0521865719 © 2010 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Other Useful Books Ian H. Witten, Alistair Moffat, Timothy C. Bell: Managing Gigabytes: Compressing and Indexing Documents and Images http://www.amazon.com/Managing-Gigabytes-Compressing-MultimediaInformation/dp/1558605703/ref=sr_1_1?ie=UTF8&s=books&qid=1286739262&sr =1-1 Ricardo Baeza-Yates, Berthier Ribeiro-Neto: Modern Information Retrieval http://www.amazon.com/Modern-Information-Retrieval-Ricardo-BaezaYates/dp/020139829X/ref=sr_1_1?s=books&ie=UTF8&qid=1286739380&sr=1-1 © 2010 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Disclaimers The vast majority of the content that follows is taken from Stanford’s CS276 course on Information Retrieval & Data Mining http://www.stanford.edu/class/cs276/ Many thanks to Prabhakar Raghavan for allowing me to re-use this content The content on these slides has a purely academic nature and has no relation to Adobe Systems Inc. whatsoever © 2010 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.