Introduction to the course and to Information Retrieval

LIS618 lecture 0
Thomas Krichel
today's lecture
• A look at the course home page
• administrative stuff
• historical matters about the course
• about me
• business of database searching
• indexes
• the Boolean information retrieval model
• practice example on Dialog
Proposed Organization
• Normal lecture
• Quiz at the beginning of every lecture
– Factually oriented, around 15 minutes
– Remove worst performance
– Average to form 50%
• Search exercise 50%
• I may make some adjustment to the
syllabus this week.
Search exercise
• Find victim of an information need
• Best to take someone you know in a
professional capacity
• Conduct interview about an information
need experienced by the victim, write
down expectations
• Search in formal database and on web
• Discuss results with the victim
• Write essay, no longer than 5 pages.
about the course
• This course is new wine in an old bottle
• Officially a merger of
– lis566 information resources on the Internet
• mailing lists
• usenet news
• web searching
– lis618 database searching
• access and use of commercial databases
mix of theory and practice
• I am not a database search practitioner.
• Each database is different, practical skills
are not easily transferable.
• Thus my emphasis in the course is more
on theory.
• In the past, I did theory first, then practice.
• These day I mix. Some theory and some
practice in every session.
What online retrieval systems?
• Dialog has been the traditional database
– They were the market leaders in online
databases in the past.
– Nowadays the field is much more open.
– They remain a very good teaching tool for
command based database searching.
• Nexis: a news database I have covered
every year.
• Google: a well-known search engine that I
started to cover two years ago.
other stuff
• Other online IR systems that I have covered in
the past
– OCLC FirstSearch
– Factiva (briefly)
– WestLaw (external speaker)
• New developments
– Peer-to-peer networks
– an introduction to reference linking using OpenURL
• Old developments with library potential
– relational databases
About me
• Born 1965, in Völklingen (Germany)
• Studied economics and social sciences at
the Universities of Toulouse, Paris, Exeter
and Leiceister.
• PhD in theoretical macroeconomics
• Lecturer in Economics at the University of
Surrey 1993 and 2001
• Since 2001 assistant professor at the
Palmer School
• During research assistantship period,
(1990 to 1993) I was constantly frustrated
with difficult access to scientific literature.
• At the same time, I discovered easy
access to freely downloadable software
over the Internet.
• I decided to work towards downloadable
scientific documents. This lead to my
library career (eventually).
Steps taken I
• 1993 founded the NetEc project at, later available at as well as at
• These are networking projects targeted to
the economics community. The bulk is
– Information about working papers
– Downloadable working papers
– Journal articles were added later
Steps taken II
• Set up RePEc, a digital library for
economics research. Catalogs
– Research documents
– Collections of research documents
– Researchers themselves
– Organizations that are important to the
research process
• Decentralized collection, model for the
open archives initiative
Steps taken III
• Co-founder of Open Archives Initiative
• Work on the Academic Metadata Format
• Co-founded rclis, a RePEc clone for
(Research in Computing, Library and
Information Science)
• Currently working on the Konz project. It
uses a database of titles of journal
published papers and tries to find them on
the Internet.
my interest in databases
• an important emphasis of course is still on
commercial databases.
• From my point of view I have two interests
in database searching
– As a provider, I must understand how people
search in order to provide some data that they
can use and will use.
– As an economist, I have a strong interest in
information as a commodity. The database
market is an important market place.
online information retrieval
• This subject can be though off as a subset
of information retrieval (IR). Most IR is
online or digital.
• IR concentrates on textual data.
• We can think of online IR to fall under two
– database IR
– web IR
database / web IR
• Database IR look at systems that have
– controlled set of record
– low heterogeneity
– use requires authentication
– advanced search features
• Web IR has opposite characteristics
traditional social model
• User goes to a library
• Describes problem to the librarian
• Librarian does the search
– without the user present
– with the user present
• Hands over the result to the user
• User fetches full-text or asks a librarian to
fetch the full text.
economic rational for traditional
• In olden days the cost of
telecommunication was high.
• Database use costs
– cost of communication
– cost of access time to the database
• The traditional model controls an upper
limit to the costs.
• With access cost time gone, the traditional
model is under threat
• There is disintermediation where the
librarian looses her role of doing the
• But that may not be good news for
information retrieval results
– user knows subject matter best
– librarian knows searching best
Web searching
• IR has received a lot of impetus through
the web, which poses unprecedented
search challenges.
• With more and more data appearing on
the web DS may be a subject in decline
– It is primarily concerned with non-web
– There is more and more web-based methods
of searching
Public access vs quality
• Now the public at large is able to do online
• At the same time need for quality answers has
• Quality-filtered services will become more
• In the current databases, there is as lot that
would already be available for free mixed with
quality-controlled stuff.
• Publishers have direct offerings and
intermediated vending is in decline.
main theory part
• Literature:
– "Modern Information Retrieval" by Ricardo
Baeza-Yates and Berthier Ribiero-Neto.
– "Information Retrieval in the Digital Age" by
Heting Chu.
• You don't need to buy the books. You
better spend practice time on databases
rather than reading books
components of the IR process
• provider
– define data that is available
• documents that can be used
• document operations
• document structure
– index
• user
– user need
– IR system familiarity
the IR process
• Query expresses user need in a query
• Processing of query yields retrieved
• Calculation of relevance ranking
• Examination of retrieved documents
• Possible return to the start, another query.
main problem
• User is not an expert at the formulation of
a query
• Garbage in garbage out, the retrieval
yields poor result
• Ways around that problem
– design very intuitive interface for the query
– give expert guidance
taxonomy of classic IR models
• Boolean, or set-theoretic
– fuzzy set models
– extended Boolean
• Vector, or algebraic
– generalized vector model
– latent semantic indexing
– neural network model
• Probabilistic
– inference network
– belief network
• There are three basic types of models in
classic information retrieval.
• Extensions of these types are a matter of
research concern and require good
mathematical skills.
• All classic models treat document as
individual pieces.
key aid: index
• An index is a list of terms, with a list of locations
where the term is to be found.
• The way to express locations usually depends
on the form that the indexed data takes.
– for a book, it is usually the page number, e.g.
"shmoo 34, 75"
– for computer files it is usually the name of the file plus
the number of the byte where the indexed term starts,
e.g. "krichel index.html 34, cv.html 890 1209"
• There is usually more than one location of the
key aid: index terms
• The index term is a part of the document that
has a meaning on its own.
• It is usually a noun word.
• Retrieval based on index term raises questions
– semantics in query or document is lost
– matching done in imprecise space of index terms
• One way out is to specify several terms and
require that they have to be close to each other.
basic concept: weight of index term
• Given all nouns, not all appear to have the same
relevance to the text
• Sometimes, we can have a simple measure of
the importance of a term, example?
• More generally, for each indexing term and each
document we can associate a weight with the
term and the document.
• Usually, if the document does not contain the
term, its weight is zero
Boolean model
• In the Boolean model, the index weight of
all index term for any document is 1 if the
term appears in the document. It is 0
• This allows to combine query terms with
Boolean operator AND, OR, and NOT
• thus powerful queries can be written
Classic implementation: dialog
• The documents that I have used
• I am also told that there are others at
Dialog is a databank
• over 500 databases
• these are also known as files and cover
– references and abstracts for published
– business information and financial data;
– complete text of articles and news stories;
– statistical tables
– directories
• DIALOG uses the Boolean model
DIALOG interface
• It is still rooted in "traditional" database
• It has been dismissed as "dial-a-dog".
• It uses a command-driven interface.
• It is very complicated to learn fully.
• It is not suitable for the end-user.
• It therefore offers a valuable skill to the
information professional.
Accessing DIALOG
On the web, go to
Enter username and password
Forget about subaccount
Then click on logon
On the next screen go to command search
"continue" at the next screen
two steps in DIALOG
• Step one: select databases (aka files) to
look at
• Step two: perform searches on the
selected databases
• You may wonder why one does not have
one single step like in a search engine.
sample search
• We want to know something about "current
awareness in digital libraries"
• From dialogweb command search:
– databases
– social sciences and humanities
– library and information science
• This leads you to
This is database selection…
• At that screen you see a number of "files"
with their number.
• You can select those that you want to
• Then you click "begin database"
• and you get back to the command search
• "b numbers" it will say. That is the
command to begin working with files.
Boolean search
• Do a number of searches
– s current(N)awarness
– s digital(N)library
– s digital(N)libraries
• Each search retrieves a set of documents
• The sets can be combined
– s s1 and (s2 or s3)
What is the deal?
• There are two stages.
• At stage two we make Boolean queries.
• Each query splits the records into
matching and non-matching records.
• The set of matching records is return.
• It can be further searched or combined
with other sets using Boolean operators.
• Try this at home.
