Introduction to the course and to Information Retrieval

advertisement
LIS618 lecture 0
Thomas Krichel
2004-01-25
today's lecture
• A look at the course home page
http://wotan.liu.edu/home/krichel/lis618p04s
• administrative stuff
• historical matters about the course
• about me
• business of database searching
• indexes
• the Boolean information retrieval model
• practice example on Dialog
Organization
• homepage
http://wotan.liu.edu/home/krichel/lis618p04s
• Contents to be discussed today.
• Send mail to krichel@openlib.org
– Your name
– Your secret word for grades delivery
• Interrupt me with as many questions as
possible!
• Ask for breaks!
Proposed Organization
• Normal lecture
• Quiz at the beginning of every lecture
– Factually oriented, around 15 minutes
– Remove worst performance
– Average to form 50%
• Search exercise 50%
• I may make some adjustment to the
syllabus this week.
Search exercise
• Find victim of an information need
• Best to take someone you know in a
professional capacity
• Conduct interview about an information
need experienced by the victim, write
down expectations
• Search in formal database and on web
• Discuss results with the victim
• Write essay, no longer than 5 pages.
about the course
• This course is new wine in an old bottle
• Officially a merger of
– lis566 information resources on the Internet
• mailing lists
• usenet news
• web searching
– lis618 database searching
• access and use of commercial databases
mix of theory and practice
• I am not a database search practitioner.
• Each database is different, practical skills
are not easily transferable.
• Thus my emphasis in the course is more
on theory.
• In the past, I did theory first, then practice.
• These day I mix. Some theory and some
practice in every session.
What online retrieval systems?
• Dialog has been the traditional database
covered.
– They were the market leaders in online
databases in the past.
– Nowadays the field is much more open.
– They remain a very good teaching tool for
command based database searching.
• Nexis: a news database I have covered
every year.
• Google: a well-known search engine that I
started to cover two years ago.
other stuff
• Other online IR systems that I have covered in
the past
– OCLC FirstSearch
– Factiva (briefly)
– WestLaw (external speaker)
• New developments
– Peer-to-peer networks
– an introduction to reference linking using OpenURL
• Old developments with library potential
– relational databases
About me
• Born 1965, in Völklingen (Germany)
• Studied economics and social sciences at
the Universities of Toulouse, Paris, Exeter
and Leiceister.
• PhD in theoretical macroeconomics
• Lecturer in Economics at the University of
Surrey 1993 and 2001
• Since 2001 assistant professor at the
Palmer School
Why?
• During research assistantship period,
(1990 to 1993) I was constantly frustrated
with difficult access to scientific literature.
• At the same time, I discovered easy
access to freely downloadable software
over the Internet.
• I decided to work towards downloadable
scientific documents. This lead to my
library career (eventually).
Steps taken I
• 1993 founded the NetEc project at
http://netec.mcc.ac.uk, later available at
http://netec.ier.hit-u.ac.jp as well as at
http://netec.wustl.edu.
• These are networking projects targeted to
the economics community. The bulk is
– Information about working papers
– Downloadable working papers
– Journal articles were added later
Steps taken II
• Set up RePEc, a digital library for
economics research. Catalogs
– Research documents
– Collections of research documents
– Researchers themselves
– Organizations that are important to the
research process
• Decentralized collection, model for the
open archives initiative
Steps taken III
• Co-founder of Open Archives Initiative
• Work on the Academic Metadata Format
• Co-founded rclis, a RePEc clone for
(Research in Computing, Library and
Information Science)
• Currently working on the Konz project. It
uses a database of titles of journal
published papers and tries to find them on
the Internet.
my interest in databases
• an important emphasis of course is still on
commercial databases.
• From my point of view I have two interests
in database searching
– As a provider, I must understand how people
search in order to provide some data that they
can use and will use.
– As an economist, I have a strong interest in
information as a commodity. The database
market is an important market place.
online information retrieval
• This subject can be though off as a subset
of information retrieval (IR). Most IR is
online or digital.
• IR concentrates on textual data.
• We can think of online IR to fall under two
categories
– database IR
– web IR
database / web IR
• Database IR look at systems that have
– controlled set of record
– low heterogeneity
– use requires authentication
– advanced search features
• Web IR has opposite characteristics
traditional social model
• User goes to a library
• Describes problem to the librarian
• Librarian does the search
– without the user present
– with the user present
• Hands over the result to the user
• User fetches full-text or asks a librarian to
fetch the full text.
economic rational for traditional
model
• In olden days the cost of
telecommunication was high.
• Database use costs
– cost of communication
– cost of access time to the database
• The traditional model controls an upper
limit to the costs.
disintermediation
• With access cost time gone, the traditional
model is under threat
• There is disintermediation where the
librarian looses her role of doing the
search.
• But that may not be good news for
information retrieval results
– user knows subject matter best
– librarian knows searching best
Web searching
• IR has received a lot of impetus through
the web, which poses unprecedented
search challenges.
• With more and more data appearing on
the web DS may be a subject in decline
– It is primarily concerned with non-web
databases
– There is more and more web-based methods
of searching
Public access vs quality
• Now the public at large is able to do online
searching.
• At the same time need for quality answers has
grown.
• Quality-filtered services will become more
important.
• In the current databases, there is as lot that
would already be available for free mixed with
quality-controlled stuff.
• Publishers have direct offerings and
intermediated vending is in decline.
main theory part
• Literature:
– "Modern Information Retrieval" by Ricardo
Baeza-Yates and Berthier Ribiero-Neto.
– "Information Retrieval in the Digital Age" by
Heting Chu.
• You don't need to buy the books. You
better spend practice time on databases
rather than reading books
components of the IR process
• provider
– define data that is available
• documents that can be used
• document operations
• document structure
– index
• user
– user need
– IR system familiarity
the IR process
• Query expresses user need in a query
language
• Processing of query yields retrieved
documents
• Calculation of relevance ranking
• Examination of retrieved documents
• Possible return to the start, another query.
main problem
• User is not an expert at the formulation of
a query
• Garbage in garbage out, the retrieval
yields poor result
• Ways around that problem
– design very intuitive interface for the query
– give expert guidance
taxonomy of classic IR models
• Boolean, or set-theoretic
– fuzzy set models
– extended Boolean
• Vector, or algebraic
– generalized vector model
– latent semantic indexing
– neural network model
• Probabilistic
– inference network
– belief network
summary
• There are three basic types of models in
classic information retrieval.
• Extensions of these types are a matter of
research concern and require good
mathematical skills.
• All classic models treat document as
individual pieces.
key aid: index
• An index is a list of terms, with a list of locations
where the term is to be found.
• The way to express locations usually depends
on the form that the indexed data takes.
– for a book, it is usually the page number, e.g.
"shmoo 34, 75"
– for computer files it is usually the name of the file plus
the number of the byte where the indexed term starts,
e.g. "krichel index.html 34, cv.html 890 1209"
• There is usually more than one location of the
term.
key aid: index terms
• The index term is a part of the document that
has a meaning on its own.
• It is usually a noun word.
• Retrieval based on index term raises questions
– semantics in query or document is lost
– matching done in imprecise space of index terms
• One way out is to specify several terms and
require that they have to be close to each other.
basic concept: weight of index term
• Given all nouns, not all appear to have the same
relevance to the text
• Sometimes, we can have a simple measure of
the importance of a term, example?
• More generally, for each indexing term and each
document we can associate a weight with the
term and the document.
• Usually, if the document does not contain the
term, its weight is zero
Boolean model
• In the Boolean model, the index weight of
all index term for any document is 1 if the
term appears in the document. It is 0
otherwise.
• This allows to combine query terms with
Boolean operator AND, OR, and NOT
• thus powerful queries can be written
Classic implementation: dialog
• The documents that I have used
– http://training.dialog.com/sem_info/courses/pdf
_sem/dlg1.pdf
– http://training.dialog.com/sem_info/courses/pdf
_sem/dlg2.pdf
– http://training.dialog.com/sem_info/courses/pdf
_sem/dlg3.pdf
– http://training.dialog.com/sem_info/courses/pdf
_sem/dlg4.pdf
• I am also told that there are others at
http://gep.dialog.com/instruction/
Dialog is a databank
• over 500 databases
• these are also known as files and cover
– references and abstracts for published
literature,
– business information and financial data;
– complete text of articles and news stories;
– statistical tables
– directories
• DIALOG uses the Boolean model
DIALOG interface
• It is still rooted in "traditional" database
systems.
• It has been dismissed as "dial-a-dog".
• It uses a command-driven interface.
• It is very complicated to learn fully.
• It is not suitable for the end-user.
• It therefore offers a valuable skill to the
information professional.
Accessing DIALOG
•
•
•
•
•
•
•
On the web, go to
http://www.dialogweb.com/
Enter username and password
Forget about subaccount
Then click on logon
On the next screen go to command search
"continue" at the next screen
two steps in DIALOG
• Step one: select databases (aka files) to
look at
• Step two: perform searches on the
selected databases
• You may wonder why one does not have
one single step like in a search engine.
Discuss.
sample search
• We want to know something about "current
awareness in digital libraries"
• From dialogweb command search:
– databases
– social sciences and humanities
– library and information science
• This leads you to
http://www.dialogweb.com/cgi/logoff?mode=
guided&url=/cgi/dwframe?href=search.html
This is database selection…
• At that screen you see a number of "files"
with their number.
• You can select those that you want to
search
• Then you click "begin database"
• and you get back to the command search
• "b numbers" it will say. That is the
command to begin working with files.
Boolean search
• Do a number of searches
– s current(N)awarness
– s digital(N)library
– s digital(N)libraries
• Each search retrieves a set of documents
• The sets can be combined
– s s1 and (s2 or s3)
What is the deal?
• There are two stages.
• At stage two we make Boolean queries.
• Each query splits the records into
matching and non-matching records.
• The set of matching records is return.
• It can be further searched or combined
with other sets using Boolean operators.
• Try this at home.
http://openlib.org/home/krichel
Thank you for your attention!
Download