INFORMATION STORAGE AND RETRIEVAL SYLLABUS

advertisement
University of Wisconsin Milwaukee
School of Information Studies
Information Storage and Retrieval
783
INSTRUCTOR:
SCHOOL:
CLASSROOM:
TIME:
SEMESTER:
OFFICE:
OFFICE HOURS:
PHONE:
EMAIL:
Dr. Jin Zhang, Associate Professor
School Information Studies, UWM
Bolton 521
Thursday, 2:00 pm to 4:40 pm
Fall 2010
Bolton 532
Wed, 1:00 pm to 2:00 pm
229-2712
jzhang@uwm.edu
1 Course Materials
Textbook:
[1] Information Storage and Retrieval by R. R. Korfhage, published by John
Wiley & Sons in 1997. ISBN 0-471-14338-3.
[2] (Optional) Introduction to Information Retrieval by Manning, C.D.,
Raghavan, P. and Schütze, H., 2008, Cambridge University Press, ISBN-13:
9780521865715.
[3] (Optional) Information Storage and Retrieval Systems Theory and
Implementation, Second Edition. Gerald J. Kowalski, Mark T. Maybury September 2000,
Kluwer Academic Publisher, ISBN 0-7923-7924-1.
Other books:
[1] Text Information Retrieval Systems by Charles T. Meadow, published by
Academic Press, Inc. in 2007, ISBN 0-12-369412-4. (Third Version)
[2] INFORMATION RETRIEVAL by C. J. van RIJSBERGEN B.Sc., Dip. NAAC,
Ph.D., M.B.C.S., F.I.E.E., C. Eng., F.R.S.E. The book is available at
http://www.dcs.gla.ac.uk/Keith/Preface.html
[3] (Optional) Visualization for Information Retrieval by Zhang, J. 2008,
Srpinger, ISBN: 978-3-540-75147-2
2
Course Description
This course on information storage and retrieval focuses on the theory and
concepts of information retrieval system, introduces the basic principles of information
storage, processing, and retrieval in terms of the information retrieval system analysis and
design.
The knowledge, experience and background in information systems are preferred.
Pre-requisites: L&I Sci 571; or cons instr.
3
Course Credit
Graduate, 3 credits
4
Course Objectives
Generally speaking, information retrieval includes two different levels: the first
one is to effectively use information in an already existing information retrieval system, it
is external; the second is to address how information is processed within an information
retrieval system, it is internal. The information retrieval and storage focuses on the latter,
the second level. The topics in this courses include query structure and its characteristics,
the representation of documents and other objects within an information system, internal
matching mechanisms, document analysis, user’s perspective, reference points, retrieval
effectiveness measure, alternative retrieval techniques, output presentation, data file
structures, visualization for information, the Internet search engine, as well as a
discussion of current research trends in the field. The aim of this course is to prepare
students as information retrieval system analysts and designers.
The objectives are:
To outline basic terminology and components in information storage and retrieval
systems
To compare and contrast information retrieval models and internal mechanisms such
as Boolean, Probability, and Vector Space Models
To outline the structure of queries and documents
To articulate fundamental functions used in information retrieval such as automatic
indexing, abstracting, and clustering
To critically evaluate information retrieval system effectiveness and improvement
techniques
To understand the unique features of Internet-based information retrieval
To describe current trends in information retrieval such as information visualization.
5
Course Grading
96-100
91-95
88-90
84-86
A
AB+
B
80-83
77-79
BC+
superior work
satisfactory, but
undistinguished work
74-76
70-73
67-69
64-66
60-63
below 60
C work is below standard
CD+
D unsatisfactory work
DF
Assignments include, and are not limited to, automatic indexing, presentations like
relevance measure review, comparison of different types of information retrieval systems
and application of expert system in information retrieval, and information system use.
A significant portion of your grade is determined by your individual assignments. It
is extremely important for you to understand the grading policies and obtain high points
on your assignments.
Assignment
Weekly Assignments (9)
Participation & Discussion
Project
Grade
45% (5% each)
10%
45%
6 Attendance & Class Participation:
Attendance is mandatory and class participation is expected. You will be graded on your
participation and contributions to class discussions.
7
Course Schedule
Week 1 Introduction
Content
What is information retrieval, Significance of information retrieval and storage,
Definition of information retrieval system, Objectives of information retrieval system,
Function overview, Relationships between Digital library and IRS, Abstraction,
Algorithm, Data structure, Measure of information systems, Logical organization,
Physical organization, Components of information retrieval systems, Comparisons among
different information systems, Research topics in IR .
Reading:
Chapter1 Introduction to Information Retrieval by Manning
Lecture notes
N. J. Belkin and W. B. Croft. Information filtering and information retrieval: Two sides of the
same coin? Communications of the ACM, 35(12):29–38, 1992.
Week 2 Data control and data presentation
Content
Query, differences between documents and queries, type of documents, types of
data structure, document surrogates, vocabulary control, structure of a thesaurus,
structural representation, fine data structure, bit and byte, MARC structure.
Reading:
Chapter2. Information Storage and Retrieval by Korfhage
Chapter 2 Introduction to Information Retrieval by Manning
Lecture notes
Week 3 Boolean system/model
Content
Sequential file, structure of a sequential file, inverted file, structure of an index
file, matching criteria, Boolean logic, limitations of Boolean logic, processing query
expression: reverse Poland Expression, rules for operations
.
Reading:
Chapter 2. Information Storage and Retrieval Systems by Korflage
Lecture notes
Week 4 Vector retrieval system/model
Content
Vector model, document-term matrix, methods for designing weights to terms,
query in the vector model, spatial representation of a document in vector model,
Similarity between a query and a document (approach I), similarity between a query and
a document (approach II), some considerations for the vector model.
Reading:
Chapters 3and 4. Information Storage and Retrieval by Korfhage
Chapter 14 Introduction to Information Retrieval by Manning
Lecture notes
Week 5 Probability Retrieval System/model
Content
Basic concepts of probability, probability theory, statistical independence, Bayes
theorem, representation of documents in the probability model, discrimination function,
probability search, assumptions of the probability model.
Reading:
Chapters 3 and 4. Information Storage and Retrieval by Korfhage
Chapter 11 Introduction to Information Retrieval by Manning
Lecture notes
S. E. Robertson and S. Walker. Some simple effective approximations to the 2–poisson model for
probabilistic weighted retrieval. In Proceedings of ACM SIGIR’94, pages 232–241, 1994.
Crestani, Fabio, Mounia Lalmas, Cornelis J. Van Rijsbergen, and Iain Campbell.
1998. Is this document relevant?. . . probably: A survey of probabilistic models
in information retrieval. ACM Computing Surveys 30(4):528–552.
Fuhr, Norbert. 1992. Probabilistic models in information retrieval. Computer Journal
35(3):243–255.
Week 6 Automatic indexing and abstracting
Content
Indexing, automatic indexing, purpose of indexing, why use automatic indexing,
stop list approach, raw term frequency approach, normalized term frequency approach,
inverse term frequency approach, and other considerations.
Reading:
Chapter 4 Introduction to Information Retrieval by Manning
Zhang J, and Nguyen T (2005). A new term significance weighting approach. Journal of
Intelligent Information Systems, 24(1), 61-85.
Robertson, S. (2004).Understanding inverse document frequency: On theoretical
arguments for IDF. Journal of Documentation, 60(5), pp.503-520.
Salton, G., Allan, J., & Singhal, A.K. (1996). Automatic text decomposition and
structuring. Information Processing and Management, 32(2), pp.127-138.
Cleverdon, CyrilW. 1991. The significance of the Cranfield tests on index languages.
In Proc. SIGIR, pp. 3–12. ACM Press.
Week 7 Similarity measure algorithms
Content
Data fusion, term association, general similarity measures, similarity measures in
the vector retrieval model, comparisons of the two kinds of similarity approaches,
extended user profile, current awareness systems, retrospective search systems, reference
point, modifying the query by the user profile.
Reading:
Chapters 4 and 5. Information Storage and Retrieval by Korfhage, and lecture notes
Zhang J, and Rasmussen E (2001). Developing a new similarity measure from two
different perspectives. Information Processing & Management, 37(2), 279-294.
Zhang J, and Rasmussen E (2002). An experimental study on the iso-content-based angle
similarity measure, Information Processing & Management, 38(3), 325-342.
A. Griffiths, H. C. Luckhurst, and P.Willett. Using interdocument similarity in document retrieval
systems. Journal of the American Society for Information Science, 37:3–11, 1986.
Bartell, Brian T., Garrison W. Cottrell, and Richard K. Belew. 1998. Optimizing
similarity using multi-query relevance feedback. JASIS 49(8):742–761.
Moffat, Alistair, and Justin Zobel. 1998. Exploring the similarity space. SIGIR Forum
32(1).
Week 8 Automatic clustering approaches
Content
Definition of automatic clustering, criteria of clustering, differences between
clustering and classification, significance of a clustering approach in IR, categorization of
clustering algorithms, non- hierarchical clustering algorithm, the K-means clustering
algorithm, K-means in SPSS, hierarchical clustering algorithm, hierarchy cluster in SPSS.
Reading:
Chapters 16 and 17 Introduction to Information Retrieval by Manning
Lecture notes
Rasmussen, E. (1992). Clustering algorithms. In W. B. Frakes and R. Baeza-Yates (Eds.)
Information retrieval: data structures & algorithms (pp.419-442). Englewood
Cliffs, NJ.: Prentice Hall.
Zhao, Y. and Karypis, G. (2002). Evaluation of hierarchical clustering algorithms for
document databases. In Proceedings of the eleventh international conference on
Information and knowledge management, pp.515-524. November 04-09, 2002,
McLean, Virginia, USA.
Hamerly, Greg, and Charles Elkan. 2003. Learning the k in k-means. In Proc. NIPS.
URL: books.nips.cc/papers/files/nips16/NIPS2003_AA36.pdf.
Jain, Anil, M. Narasimha Murty, and Patrick Flynn. 1999. Data clustering: A review.
ACM Computing Surveys 31(3):264–323.
Murtagh, Fionn. 1983. A survey of recent advances in hierarchical clustering algorithms.
Computer Journal 26(4):354–359.
Week 9 Theory: Information Visualization
Content
Visualization, visualization for information retrieval, analysis of traditional
information retrieval systems, navigation problems on WWW, why use visualization for
information retrieval, core of visualization for information retrieval, functionality of
visualization, Boolean-based information retrieval system, non-Boolean-based
information retrieval system, visualization of web-based information,consideration from
cognitive engineering, history of visualization, technical environment for the
visualization, potential research topics.
Reading:
Chapter 1 Visualization for Information Retrieval by Zhang
Lecture notes
Card, S.K., Machinlay, J.D., and Shneiderman, B. (1999). Readings in information
visualization: using vision to think. San Francisco: Morgan Kaufmann, pp, 1-34.
Hearst, M.A. (1999). User interfaces and visualization. In R. Baeza-Yates and B.
Ribeiro-Neto, editors, Modern Information Retrieval, chapter 10, pp. 257--323.
Addison Wesley, Harlow.
Keim, D. A. (2001). Visual exploration of large data sets. Communications of the ACM,
44, 8, pp.38-44.
Week 10 Systems: Information Visualization
Content
Visualization systems, VIBE, DARE, Visual thesaurus, Inxight, Reveal things,
Tilebars, SQWID, JAIR INFORMATION SPACE, WebMap, Excentric Labeling, Tree
map, LifeLines, Web Brain, NiF Elastic Catalog, Dynamic Diagrams, Health InfoPark
Reading:
Chapter 8 Visualization for Information Retrieval by Zhang
Lecture notes
Benford S, Snowdon D, Greenhalgh C, Ingram R, and Knox I (1995). VR-VIBE: A
Virtual Environment for Co-operative Information Retrieval. Proceeding of
Eurographics’95, August 30th-September 1st , 1995, Maastricht, pp.349-360.
Chen C (1999). Visualising semantic spaces and author co-citation networks in digital
libraries. Information Processing and Management, 35(3), 401-420.
Hearst MA (1995). TileBars: visualization of term distribution information in full text
information access. Proceedings of the SIGCHI Conference on Human Factors in
Computing Systems’95, May 7-11, 1995, Denver, Colorado, pp. 59-66.
Korfhage RR, and Olsen KA (1994). The role of visualization in document analysis.
Proceedings of Third Annual Symposium on Document analysis and Information
retrieval’94, April 11-13th, 1994, Las Vegas, Nevada, pp.199-207.
Nuchprayoon, A. and Korfhage, R.R. (1997). GUIDO: visualizing document retrieval.
Proceedings of the IEEE Information Visualization symposium’97, September
23-26, 1997, Isle Capri, Italy, pp.184-188.
Zhang J, and Korfhage RR (1999). DARE: Distance and Angle Retrieval Environment: A
Tale of the Two Measures. Journal of the American Society for Information
Science, 50(9), 779-787.
Week 11 Internet Information Retrieval
Content
Challenge in the Web, language distribution, centralized architecture, crawlers,
jargons, crawling the Web, breadth first approach, depth first approach, crawling
approach, web page ranking, meta-search, considerations for meta-search engines, trends
Reading: lecture notes
Chapters 9 and 10 Introduction to Information Retrieval by Manning
Brin, Sergey, and Lawrence Page. 1998. The anatomy of a large-scale hypertextual
web search engine. In Proc. WWW, pp. 107–117.
Broder, Andrei. 2002. A taxonomy of web search. SIGIR Forum 36(2):3–10.
Gerrand, Peter. 2007. Estimating linguistic diversity on the internet: A taxonomy
to avoid pitfalls and paradoxes. Journal of Computer-Mediated Communication 12(4).
URL: jcmc.indiana.edu/vol12/issue4/gerrand.html.
Glover, Eric J., Kostas Tsioutsiouliklis, Steve Lawrence, David M. Pennock,
and Gary W. Flake. 2002. Using web structure for classifying and describing
web pages. In Proc. WWW, pp. 562–569. ACM Press.
Week 12 Image retrieval
Content
Content-based image retrieval, image feature description, color, color histogram,
color order system, texture, Shape, characteristics of image queries, image system
applications, image retrieval systems
Reading:
Lecture notes
Week 13 Evaluation issues
Content
Seven criteria for evaluation for information retrieval, Average recall and average
precision, Harmonic mean, evaluation of a search engine, relevance issue, Kappa
measure, quality versus quantity, Possible factors which influence outcome of a search,
Grandfield experimental study
Reading:
Chapter 8. Information Storage and Retrieval by Korfhage
Chapter8 Introduction to Information Retrieval by Manning
Lecture notes
Saracevic, T. (2007). Relevance: A review of the literature and a framework for thinking
on the notion in information science. Part II: nature and manifestations of relevance.
Journal of the American Society for Information Science and Technology, 58(3), 19151933.
Saracevic, T. (2007). Relevance: A review of the literature and a framework for thinking
on the notion in information science. Part III: Behavior and effects of relevance. Journal
of the American Society for Information Science and Technology, 58(13), 2126-2144.
Harter, Stephen P. 1998. Variations in relevance assessments and the measurement of
retrieval effectiveness. JASIS 47:37–49.
Week 14
Student presentation
Note: * If you are a student with special needs, please feel free to discuss them with the
instructor.
* The schedule may be changed
Summary:
Week1, Sept 3
Week2, Sept 10
Week3, Sept 24
Week4, Oct 1
Week5, Oct 8
Week6, Oct 15
Week7, Oct 22
Week8, Oct 29
Week9, Nov 5
Week10, Nov 12
Week11, Nov 19
Week12, Nov 26
Week12, Nov 26 Thanksgiving recess
Week14, Dec 3
Week15, Dec 10
Introduction
Data control and data presentation
Boolean retrieval system/model
Vector retrieval system/model
Probability retrieval system/model
Automatic indexing and abstracting
Similarity measure approaches
Automatic clustering approaches
Theory: information visualization
Systems: information visualization
Internet information retrieval
Image retrieval
No class
Evaluation issues
Project presentation
Term paper topic list
Students will develop a 15-20 page paper on one of the topics listed below. Papers will
characterize current issues associated with the topic, discuss the state of the art of the
topic, evaluate sample systems, and outline future directions for the area. Papers must
integrate a minimum of 15 relevant sources. Papers should use the American
Psychological Association (APA) style (http://apastyle.apa.org/)
[1]. Music information retrieval
[2]. Image information organization and retrieval
[3]. Automatic indexing theory and practice
[4]. Evaluation of search engines
[5]. On Boolean-based information retrieval system
[6]. Comparison between Boolean-based and Vector-based information systems
[7]. Visualization of information: theoretical aspect
[8]. Visualization of information: system aspect
[9]. Automatic indexing/abstracting theory and practice
[10]. Other information retrieval models
[11]. Evaluation of an information visualization system
Download