history of information retrieval

advertisement
LIS618 lecture 2
history
Thomas Krichel
2011-09-19
contents
• based on a very fine paper by Michael Lesk
“The Seven Ages of Information Retrieval”.
• That paper was written in 1997, so it does not
cover recent advances.
general problem
• The general approach here is a computer
science approach. How to build systems that
will allow people to obtain information.
• This can be distinguished from the approach
more positivistic approach to observe how
people actually deal with obtaining
information.
general overview
• In the last decades, no industry has been
subjected to more technology change than
any other industry.
• Nevertheless, there are two other factors of
note
– the political and economic environment
– research and theoretical advances
the broad picture
• Through the life of information retrieval there
has been a constant struggle about how to
prepare the system by intellectual or statistical
methods.
• Intellectual methods require substantial
human input. Despite that, they have not
been completely been outruled.
statistical methods
• When we use statistical methods, we
interrogate a text for tokens of potential
semantic significance, e.g. words. (details in
next lecture).
• We look at how much they occur in
documents. It’s a brute force job.
• We store a representation of this count in a
computer.
information analysis method
• In this approach we have humans analyze the
contents of documents to make judgments
about them.
• This analysis predates computers.
• Early computer-based systems had to use this
because the computers were not capable.
• In modern systems, such as web information
retrieval, this is making a comeback.
1945 As We May Think
• This is very famous (but not a good IMHO)
paper by Vannevar Bush.
• Bush envisages the memex. It stores the
human knowledge in a desktop device.
• It lets users mange commented associations
between its stored items.
• The essay pays scant attention to information
retrieval (bar condemning existing
approaches).
1950s
• The USSR launches the sputnik.
• The US is worried
– need to improve the organization of science
– need to understand what the Russians are doing
• learn Russian
• work on machine translation
• Suddenly it seems like a good idea to go back
to something like a wartime science effort.
major problems
• There were about 100 computers in the
United States.
• Computers operated in a batch-processing,
rather than in an interactive mode.
• Supply of machine-readable text was very
small.
special hardware
• There were some experiments with special
hardware
– Edge-notched cards by Calvin Moors
– WRU Searching Selector of Allen Kent
• These were abandoned as the digital
technology progressed.
Hans Peter Luhn (1896—1964)
• In 1957 he was the first to propose extracting
information from text by automated means.
– “It is here proposed that the frequency of word
occurrence in an article furnishes a useful
measurement of word significance. It is further
proposed that the relative position within a
sentence of words having given values of
significance furnish a useful measurement for
determining the significance of sentences. The
significance factor of a sentence will therefore be
based on a combination of these two
measurements.”
KWIC
• Luhn also worked on KWIC (key word in
context) indexing.
• The idea was that information retrieval should
not only be based on words. Example
the idea was that Information Retrieval should
worked on KWIC key word in
Hans Peter Luhn was the firs
1960
• This is when online IR really started.
• There were the first systems being built, with
heavy subsidies.
• Some of today’s available scientific IR systems
can be trace to precursors built at that time.
fuller descriptions
• As computers become more powerful, more
and more descriptors could be taken from
texts to make them findable in a retrieval
systems.
• Taking more descriptors, however meant more
work for the people involved in finding the
descriptive terms.
full-text indexing
• During the 60s people started to look at the
situation where all words would be used as
descriptors for the document.
• Such a strategy is
– less work for the human
– more work for the computer
IR performance criteria
• In order to compare full-text and selective
indexing, criteria had to be found and
experiments to be conducted.
• Cyril Cleverdon (1914—1997) developed
criteria of precision and recall.
• He and his disciples worked an a standard
document set that could be used for testing,
about 14k abstract on aeronautics.
and they found
• They found that full-text indexing was
superior to human-aided indexing.
• This was very controversial at the time. All
catalogers got very upset!
new retrieval techniques
• Having an experimental dataset that was
shared by researchers was key to developing
experiments that were comparable.
• One idea was “relevance feedback”. The idea
was the user would select relevant documents
in a fist step, and then the term from these
documents would be added to the query in a
second step.
NASA ReCON
• Name stand for “remote console”.
• It is thought to be the first multi-site
bibliographic online information system for
scientific literature.
• First large scale information systems
– online
– interactive
– distributed
NASA ReCon
• It dealt with 200000 citation strings.
• It searched for keywords manually extracted
from text.
• It allowed fielded searches.
• It allowed combinations of fields related and
combinations are results received.
evolution from testbeds to services
• ReCON was developed by Lockheed in 1965.
• It become a testbed at Lockheed using the
ERIC database in 1969.
• It became the DIALOG online system in 1972.
• That system was the largest online general
retrieval system, mainly for bibliographic data
for a long time.
specialized engines
• COLEX was a system developed in 1965 for the
Air Force by Systems Development
Corporation.
• It become SDC’s ORBIT system.
• The National Library of Medicine used that
system to build the MEDLARS system that
would give online access to medical abstracts
as early an 1991.
increase of available text
• The development of computer-aided
typesetting, and later word processing made
for a lot more text available online than
before.
• Among the first were the larger abstracting
database such as Chemical abstracts.
• Later newspapers followed.
other services
• In 1973 the Lexis system of US court records
was the first large scale full text database.
• OCLC got tapes from LoC and printed catalog
cards. Later they started shared cataloging,
because the LoC only cataloged about 2/3 of
what the member libraries had.
research
• Research into IR prototype systems actually
declined in the US.
• NSF shifted away research funds to the
development of actual systems.
• There was some work on AI is IR but it got
pretty much nowhere.
research on models
• Gerard Salton (1927—1995) and his group at
Cornell refined the vector space model and
introduced tf/idf term weighting in the early
70s. This will be covered later.
• Later in the 70s the probabilistic model of
information retrieval emerged in work by
Keith van Rijsbergen (1943—) and friends. This
will not be covered in the course.
80s
• This decade saw the introduction of
inexpensive personal computers.
• In libraries
– Card catalogs are replaced by OPACs allowing nonspecialists access to online information.
– Online information started to use full text.
full-text retrieval
• In the scientific abstract databases full-text is
not that important.
• But in legal databases, it is. Lexis were the first
to provide full-text access.
• Later they were followed by the newspapers.
They were among the first non-reference
publishers to adopt computerized typesetting.
These databases contained text only.
research effort
• The vector model was still the start, but
researcher tried to using dictionaries to
disambiguate terms in documents and
queries.
• Some work was done on bi-lingual retrieval.
• Some work was done on part of speech
assignments.
• Research made no impact on real-world
commercial systems.
CD ROM
• In the late 80s the CD-ROM appeared to make
a dent in online information retrieval.
• The CD-ROM fits the standard print publishing
model.
• Sharing CD-ROMs however, proved a
nightmare in libraries.
90s
• The 90s were the decade of the Internet.
• Online information retrieval become common.
• The CD-ROM lost out.
integration of graphics
• The web became compelling as soon as a
graphical user interface was introduced by the
mosaic browser.
• This introduced the subject of multimedia
information retrieval, largely defined to
research experiments before, out in the open.
start of search engines
• Early search engines looked at the web with
the eyes of earlier information retrieval
engines.
• The essentially applied the vector model to
the text extracted from web pages.
• But web pages also contain also pictures and
more importantly, links.
Stanford digital library project
• Run at Stanford between 1995 and 2004 by a
group of researchers to “provide an
infrastructure that affords interoperability
among heterogeneous, autonomous digital
library services.”
• In 1996 as Sergei Brin and Larry Page started
to work on a project to analyze the links
between web pages.
web information retrieval
• I argue that web information retrieval is
different from conventional retrieval because
on the web pages have different intrinsic
values.
• The PageRank algorithm is a simple algorithm
to make such values appear.
http://openlib.org/home/krichel
Please shutdown the computers when
you are done.
Thank you for your attention!
Download