LIBR 558: Information Retrieval Systems: Structures and Algorithms

advertisement
LIBR 558: Information Retrieval Systems: Structures and Algorithms
Course Syllabus
Program:
Master of Library and Information Studies
Year:
Fall session 2013-2014, Term 1
Course Schedule: Wednesday, 2:00 p.m. – 4:50 p.m.
Location:
IBLC 158
Instructor:
Edie Rasmussen
Office location:
Room 478, IBLC
Office phone:
(604) 827-5486
Office hours:
Monday, 10:30 a.m. – 11:30 a.m.
Wednesday, 10:30 a.m. – 11:30 a.m.
Email address:
edie.rasmussen@ubc.ca
Course website:
See your Connect account.
Course Goal: To provide an introduction to the methods used in the storage and retrieval
of textual, pictorial, graphic, and voice data.
Course Objectives:





understand the complexity of information retrieval;
understand the functions of an information retrieval system;
be able to understand and measure the contribution of the components of an
information retrieval system to its performance;
be able to isolate the factors which optimize the information retrieval process;
be aware of current issues in information retrieval, including search engines.
Course Topics:








Documents and queries
Information retrieval models
Evaluating information retrieval systems
Implementing information retrieval systems
Improving effectiveness of information retrieval systems
Multimedia information retrieval systems
Information retrieval on the WWW
Users and information retrieval
1
Prerequisites:
MLIS and Dual MAS/MLIS: completion of the MLIS core
MAS: completion of MAS core and permission of the SLAIS Graduate Adviser ARST/LIBR
500, LIBR 501
Format of the course: Lectures, journal club, presentations, guest speakers, lab
sessions
Recommended and Required Readings:
Weakly readings will draw heavily on several recently published monographs:





R. Baeza-Yates and B. Ribeiro-Neto (2011). Modern Information Retrieval: the
Concepts and Technology Behind Search. 2nd ed. Harlow, England: Addison-Wesley.
S. Bϋttcher, C.L.A. Clarke, and G.V. Cormack (2010). Information Retrieval:
Implementing and Evaluating Search Engines. Cambridge, MA: MIT Press.
W.B. Croft, D. Metzler, T. Strohman, Search Engines: Information Retrieval in Practice.
Boston: Addison Wesley, 2010. (SEIRIP)
D. Manning, P. Raghavan, and H. Schϋtze, Introduction to Information Retrieval.
Cambridge : Cambridge University Press, 2008. (IIR)
Manning et al., and several chapters of Croft et al., and Bϋttcher et al., are available
on the publisher’ web site as noted in the readings. Copies are also available on
reserve.
Readings by Week:
Week
Readings
J. Allan, B. Croft, A. Moffat and M. Sanderson (2012). Frontiers,
challenges and opportunities for information retrieval. ACM SIGIR
Forum 46(1): 2-32 (June 2012). Available at
http://dl.acm.org/citation.cfm?id=2215678
N.J. Belkin (2008). Some(what) grand challenges for information
retrieval. Keynote lecture, European Conference on Information
Retrieval, Glasgow, Scotland, 31 March 2008. Available at
http://www.sigir.org/forum/2008J/2008j-sigirforum-belkin.pdf
1
Chapter 1, Search engines and information retrieval. In W.B. Croft,
D. Metzler, T. Strohman, Search Engines: Information Retrieval in
Practice. Boston: Addison Wesley, 2010. Pp. 1-12.
N. Fuhr (2012). Salton Award Lecture: Information Retrieval as
Engineering Science. ACM SIGIR Forum 46(2): 19-28. Available at
http://dl.acm.org/citation.cfm?id=2422259
Griffiths, J.-M. and King, D.W. (2002). US information retrieval
system evolution and evaluation (1945-1975). IEEE Annals of the
History of Computing, 24(30: 35-55. [Available at
2
http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=1024761
2
Chapter. 2: The term vocabulary and postings lists. In. C.D.
Manning, P. Raghavan, and H. Schϋtze, Introduction to Information
Retrieval. Cambridge : Cambridge University Press, 2008. Pp. 1844. [Available at
http://nlp.stanford.edu/IR-book/pdf/02voc.pdf
Chapter 4, Processing Text. In W.B. Croft, D. Metzler, T. Strohman,
Search Engines: Information Retrieval in Practice. Boston: Addison
Wesley, 2010. Available at:
http://www.pearsonhighered.com/croft1epreview/pdf/chap4.pdf
Chapter 1: Boolean retrieval. In. C.D. Manning, P. Raghavan, and H.
Schϋtze, Introduction to Information Retrieval. Cambridge :
Cambridge University Press, 2008. Pp. 1-17. [Available at
http://nlp.stanford.edu/IR-book/pdf/01bool.pdf]
3
Chapter 2, Architecture of a search engine. In W.B. Croft, D.
Metzler, T. Strohman, Search Engines: Information Retrieval in
Practice. Boston: Addison Wesley, 2010. Pp. 13-29.
Chapter 7, Retrieval models. In W.B. Croft, D. Metzler, T.
Strohman, Search Engines: Information Retrieval in Practice.
Boston: Addison Wesley, 2010. Pp. 233-296.
F. Crestani, M. Lalmas, C.J. Van Rijsbergen, and I. Campbell (1998).
“Is This Document Relevant? . . . Probably”: A Survey of Probabilistic
Models in Information Retrieval. ACM Computing Surveys 30(4):
528-552. Available at http://tinediey.cc/i7el1
4
Lemur Project Tutorials: Starting Out: Overview: Language Models
and Information Retrieval. Available at http://tinyurl.com/2a5gqte
X. Liu and W.B. Croft (2005). Statistical language modeling for
information retrieval. Annual Review of Information Science and
Technology 39(1): 1-31.
5
P. Borlund (2003). The concept of relevance in IR. Journal of the
American Society for Information Science and Technology, 54(10):
913-925.
Chapter 8, Evaluating search engines. In W.B. Croft, D. Metzler, T.
Strohman, Search Engines: Information Retrieval in Practice.
3
Boston: Addison Wesley, 2010. Available at
http://www.pearsonhighered.com/croft1epreview/pdf/chap8.pdf
Chapter 8, Evaluation in information retrieval. In C.D. Manning, P.
Raghavan and H. Schϋtze, Introduction to Information Retrieval.
Cambridge : Cambridge University Press, 2008. Available at
http://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf
D. Harman, D. (2011). Information Retrieval Evaluation. Synthesis
Lectures on Information Concepts, Retrieval, and Services #19.
Morgan & Claypool Publishers. (107 p.) (Browse)
S. Robertson (2008). On the history of evaluation in IR. Journal of
Information Science 34(4): 439-2008. Available at
http://jis.sagepub.com/cgi/content/abstract/34/4/439
M. Sanderson (2010). Test collection based evaluation of information
retrieval systems. Foundations and Trends in Information Retrieval
4(4): 247-375. (Browse)
E. Voorhees (2007). TREC: Continuing information retrieval’s
tradition of experimentation. Communications of the ACM 50(11):
51-54. [Available at
http://portal.acm.org/citation.cfm?doid=1297797.1297822]
Chapter 5, Relevance feedback and query expansion. In R. BaezaYates and B Ribeiro-Neto, Modern Information Retrieval. Harlow,
England: Addison-Wesley, 2011. Pp. 177-202.
6
Chapter. 9, Relevance feedback and query expansion. In. C.D.
Manning, P. Raghavan and H. Schϋtze, Introduction to Information
Retrieval. Cambridge : Cambridge University Press, 2008. Pp. 162177. [Available at
http://nlp.stanford.edu/IR-book/pdf/09expand.pdf]
G. Salton & C. Buckley (1990). Improving retrieval performance by
relevance feedback. Journal of the American Society for Information
Science 41: 288-297.
Chapter 2, User Interfaces for Search (M. Hearst). In R. BaezaYates and B Ribeiro-Neto, Modern Information Retrieval. Harlow,
England: Addison-Wesley, 2011. Pp. 21-55.
7
Chapter 5: The User-oriented IR research approach. In Ingwersen,
P. (1992). Information Retrieval Interaction. London, TaylorGraham. Full text of book is available at
http://vip.db.dk/pi/iri/index.htm.
B.J. Jansen (2009). Understanding User-Web Interactions via Web
4
Analytics. Synthesis Lectures on Information Concepts, Retrieval,
and Services #6. Morgan & Claypool
D. Kelly (2009) Methods for Evaluating Interactive Information
Retrieval Systems with Users, Foundations and Trends in Information
Retrieval: Vol. 3: No 1—2, pp 1-224. (browse)
Chapter 19, Web search basics; Chapter 20, Web crawling and
indexes. In C.D. Manning, P. Raghavan and H. Schϋtze, Introduction
to Information Retrieval. Cambridge : Cambridge University Press,
2008 Available at
http://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf
8
M. Franceschet (2011). PageRank: Standing on the shoulders of
giants. Communications of the ACM 54(6): 92-101.
http://cacm.acm.org/magazines/2011/6/108660-pagerank-standingon-the-shoulders-of-giants/fulltext
P. Vogl and M. Barrett (2010). Regulating the information
gatekeepers. Communications of the ACM 53(11): 67-72. Available
at http://cacm.acm.org/magazines/2010/11/100623-regulating-theinformation-gatekeepers/fulltext
I.H. Witten (2008). Searching… in a Web. Journal of Universal
Computer Science 14(10): 1739-1762. Available at
http://www.cs.waikato.ac.nz/~ihw/papers/08-IHWSearchinginWeb.pdf
Ch. 14, Multimedia Information Retrieval (D. Ponceléon and M.
Slaney). In R. Baeza-Yates and B Ribeiro-Neto, Modern Information
Retrieval. Harlow, England: Addison-Wesley, 2011. Pp. 587-639.
R. Datta, D. Josh, J. Li and J.Z. Wang. (2008). Image retrieval:
ideas, influences and trends of the new age. ACM Computing
Surveys 40(2): 5.1-5.60.
9
P. Enser (2008). The evolution of visual information retrieval.
Journal of Information Science 34: 531-546. Available at
http://jis.sagepub.com/cgi/content/abstract/34/4/531
M.S. Lew, N. Sebe, C. Djeraba, and R. Jain (2006). Content-based
multimedia information retrieval: state of the art and challenges.
ACM Transactions on Multimedia Computing, Communications, and
Applications 2(1): 1-19. Available at
http://www.liacs.nl/~mlew/mir.survey16b.pdf
5
S. Rϋger (2010). Multimedia Information Retrieval. Synthesis
Lectures on Information Concepts, Retrieval, and Services #10.
Morgan & Claypool Publishers. (157 pp.). (Chapters 1 and 2)
10
Mid-term Examination (Take-Home)
Chapter 10, Social search. In W.B. Croft, D. Metzler, T. Strohman,
Search Engines: Information Retrieval in Practice. Boston: Addison
Wesley, 2010. Pp. 397-450.
D. Das and A.F.T. Martins (2007). A survey on automatic text
summarization. Language Technologies Institute, Carnegie Mellon
University. Available at www.cs.cmu.edu/~nasmith/LS2/dasmartins.07.pdf
11
W. Fan, L. Wallace, S. Rich, and Z. Zhang. (2006). Tapping the
power of text mining. Communications of the ACM 49(9): 76-82.
Available at http://dl.acm.org/citation.cfm?id=1151032
K.L. Kroeker (2011). Weighing Watson’s impact. Communications
of the ACM 54(7): 13-15. Available at
http://cacm.acm.org/magazines/2011/7/109887-weighing-watsonsimpact/fulltext
Roussinov, D., Weiguo, F. and Robles-Flores, J. (2008). Beyond
keywords: automated question answering on the Web.
Communications of the ACM 51(9): 60-65. [Available at
http://portal.acm.org/citation.cfm?id=1378743]
12
J. Zobel, and A. Moffat (2006). Inverted files for text search
engines. ACM Computing Surveys 38(2): 1-56.
13
Wrap-up
Recommended
A few more basic textbooks---older, but may be useful for explanations of some topics:


G.G. Chowdhury (1999). Introduction to Modern Information Retrieval. London:
Library Association.
W.B. Frakes and R. Baeza-Yates (eds.) (1992). Information Retrieval: Data Structures
& Algorithms. Englewood Cliffs, NJ: Prentice-Hall.
6


R.R. Korfhage (1997). Information Storage and Retrieval. New York: John Wiley.
I.H. Witten, A. Moffat, and T.C. Bell (1999). Managing Gigabytes: Compressing and
Indexing Documents and Images. 2nd ed. San Francisco, CA: Morgan Kaufmann.
And, two series which have excellent chapter (or longer) length discussions of specific
topics within information retrieval:


Foundations and Trends in Information Retrieval (D. Oard and M. Sanderson, Editorsin-Chief). NOW Publishing.
http://www.nowpublishers.com/journals/Foundations%20and%20Trends%C2%AE%20
in%20Information%20Retrieval/4#volumes
Synthesis Lectures on Information Concepts, Retrieval and Services. (G. Marchionini,
Editor) Morgan-Claypool http://www.morganclaypool.com/toc/icr/1/1
Course Assignments and Weight in relation to final course mark:
Assignment
A short paper providing the context for a journal club
discussion; leading the discussion
Midterm Exam (Take-Home)
Project (Group optional) Implement an information
retrieval software package and evaluate it using a
standard text collection and metrics; present results.
Participation
Weight
20%
30%
40%
10%
Course Schedule:
Date
Topic
Week 1
September 4
Introduction to course
Introduction to information
retrieval
 Goals of IR
 History of IR
 Recent developments
Allan (2012)
Belkin (2008)
Fuhr (2012)
Griffiths (2002)
SEIRIP, Ch. 1
Week 2
September 11
Representing document content
 Document processing
 Statistics of text
 Text processing (tokenizing,
stopwords, stemming)
 Index term weighting
 Weights and metrics
IIR, Ch. 2
SEIRIP, Ch. 4
Week 3
TPD
Information retrieval models I:
Readings/
Assignment
IIR, Ch 1
SEIRIP, Ch. 2,7
7
Classical models
 Boolean models
 Ranked output models
Week 4
TBD
Information retrieval models II:
Newer models
 Language models
 Latent semantic indexing
 Learning to rank
Crestani (1998)
Lemur
Liu (2005)
Week 5
October 2
Measuring effectiveness of IR
systems
 Laboratory models
 Evaluation measures
 Role of TREC
 Interactive evaluation
 User analytics
Borlund (2003)
CIIR, Ch. 8
SEIRIP, Ch. 8
Harman (2011)
Robertson (2008)
Sanderson (2010)
Voorhees (2007)
Week 6
October 9
Improving effectiveness of IR
systems
 Query expansion
 Relevance feedback
 Augmented search (e.g.
Wikification)
 Wisdom of crowds
 Recommender systems
 Tagging and folksonomies
IIR, Ch. 9
MIR, Ch. 5
Salton (1990)
Week 7
October 16
Week 8
October 23
Week 9
October 30
Users and information systems
 Interactive IR
 Studying users
 User interfaces
 Information visualization
IIIR, Ch. 5
Ingwersen (1992)
Jansen (2009)
Kelly (2009)
MIR, Ch. 2
IR applications I: Web search
engines
 Web IR
 Crawlers and indexers
 Search engines
 Webometrics
 Adversarial Search
 Computational Advertising
IIIR, Chs. 19, 20
Franceschet (2011)
Vogl (2010)
Witten (2008)
IR Applications II: Multimedia IR
Datta (2008)
8




Multimedia IR
Image retrieval
Video retrieval
Audio and music retrieval
Enser (2008)
Lew (2006)
MIR, Ch 14
Rϋger (2010)
Week 10
November 6
Mid-term Examination (Take
Home)
Week 11
November 13
IR Applications III: Working with
Text
 Text mining
 Text summarization
 Text categorization
 Question-answering systems
Das (2007)
Fan (2006)
Kroeker (2011)
Roussinov, 2008
SEIRIP, Ch 10
Week 12
November 20
Implementing IR systems
 History
 Technical issues
 Reliability
 Scalability
Zobel (2006)
Week 13
November 27
Course summary
Project/paper presentations
Final
papers/projects
Course Policies:
Attendance: The calendar states: “Regular attendance is expected of students in all their
classes (including lectures, laboratories, tutorials, seminars, etc.). Students who neglect
their academic work and assignments may be excluded from the final examinations.
Students who are unavoidably absent because of illness or disability should report to their
instructors on return to classes.”
Regular on-time attendance in class is an important and required part of this course. It is
your responsibility to obtain from one of the other class members any handouts
distributed and notes taken during sessions you miss.
Sudden unexpected problems arise for everyone (including myself), but I expect you to
attend and be on time for class. Absences or repeated tardiness will result in a lower
course mark or in a request from me that you drop the course. The size of an attendancerelated course mark penalty will be determined by the instructor. If you ARE late for class
(for whatever reason) please come into the classroom rather than waiting for the break.
9
Evaluation: All assignments will be marked using the evaluative criteria given on the
SLAIS web site.
The grades are based on submission of the assignment in accordance with the due date.
Decisions on extensions will be made on a case-by-case basis and extensions may result
in a grading penalty at the discretion of the instructor. Please note that within these
guidelines, a B+ mark is given for “Work demonstrating diligence and effort above basic
requirements.”
Written & Spoken English Requirement: Written and spoken work may receive a
lower mark if it is, in the opinion of the instructor, deficient in English.
Access & Diversity: Access & Diversity works with the University to create an inclusive
living and learning environment in which all students can thrive. The University
accommodates students with disabilities who have registered with the Access and
Diversity unit: [http://www.students.ubc.ca/access/drc.cfm]. You must register with the
Disability Resource Centre to be granted special accommodations for any on-going
conditions.
Religious Accommodation: The University accommodates students whose religious
obligations conflict with attendance, submitting assignments, or completing scheduled
tests and examinations. Please let your instructor know in advance, preferably in the first
week of class, if you will require any accommodation on these grounds. Students who plan
to be absent for varsity athletics, family obligations, or other similar commitments, cannot
assume they will be accommodated, and should discuss their commitments with the
instructor before the course drop date. UBC policy on Religious Holidays:
http://www.universitycounsel.ubc.ca/policies/policy65.pdf .
Academic Integrity
Plagiarism
The Faculty of Arts considers plagiarism to be the most serious academic offence that a
student can commit. Regardless of whether or not it was committed intentionally,
plagiarism has serious academic consequences and can result in expulsion from the
university. Plagiarism involves the improper use of somebody else's words or ideas in
one's work.
It is your responsibility to make sure you fully understand what plagiarism is. Many
students who think they understand plagiarism do in fact commit what UBC calls "reckless
plagiarism." Below is an excerpt on reckless plagiarism from UBC Faculty of Arts' leaflet,
"Plagiarism Avoided: Taking Responsibility for Your Work," (http://www.arts.ubc.ca/artsstudents/plagiarism-avoided.html).
"The bulk of plagiarism falls into this category. Reckless plagiarism is often the result of
careless research, poor time management, and a lack of confidence in your own ability to
think critically. Examples of reckless plagiarism include:

Taking phrases, sentences, paragraphs, or statistical findings from a variety of sources
and piecing them together into an essay (piecemeal plagiarism);
10

Taking the words of another author and failing to note clearly that they are not your
own. In other words, you have not put a direct quotation within quotation marks;

Using statistical findings without acknowledging your source;

Taking another author's idea, without your own critical analysis, and failing to
acknowledge that this idea is not yours;

Paraphrasing (i.e. rewording or rearranging words so that your work resembles, but
does not copy, the original) without acknowledging your source;

Using footnotes or material quoted in other sources as if they were the results of your
own research; and

Submitting a piece of work with inaccurate text references, sloppy footnotes, or
incomplete source (bibliographic) information."
Bear in mind that this is only one example of the different forms of plagiarism. Before
preparing for their written assignments, students are strongly encouraged to familiarize
themselves with the following source on plagiarism: the Academic Integrity Resource
Centre http://help.library.ubc.ca/researching/academic-integrity. Additional information
is available on the SAIS Student Portal http://connect.ubc.ca.
If after reading these materials you still are unsure about how to properly use sources in
your work, please ask me for clarification.
Students are held responsible for knowing and following all University regulations
regarding academic dishonesty. If a student does not know how to properly cite a source
or what constitutes proper use of a source it is the student's personal responsibility to
obtain the needed information and to apply it within University guidelines and policies. If
evidence of academic dishonesty is found in a course assignment, previously submitted
work in this course may be reviewed for possible academic dishonesty and grades
modified as appropriate. UBC policy requires that all suspected cases of academic
dishonesty must be forwarded to the Dean for possible action.
Additional course information:
Use of Connect:
Connect will be used for discussion of the readings and to communicate any special
announcements, clarifications on assignments, etc.
If you have general comments or queries, use the discussion list; if you have individual
concerns, please email me at edie.rasmussen@ubc.ca, call or visit during office hours.
Note that if your query is of general concern, I may address it on the course list.
Use of Dedicated Computer:
The computer closest to the door in the Kitimat Lab is available for class projects related
to IR evaluation; it has the appropriate software and datasets for our use. Note that the
11
software used is Open Source and you can load it on your own machines, but our license
for the data does not allow me to distribute it freely.
12
Download