6-early-work - University of Illinois at Urbana

advertisement
Early Milestones in IR Research
ChengXiang Zhai
Department of Computer Science
Graduate School of Library & Information Science
Institute for Genomic Biology
Department of Statistics
University of Illinois, Urbana-Champaign
1
Major Research Milestones
•
Early days (late 1950s to 1960s): foundation and founding of the field
– Luhn’s work on automatic encoding
Indexing: auto vs. manual
– Cleverdon’s Cranfield evaluation methodology and index experiments
Evaluation
System
– Maron, Kuhns, & Rocchio’s work on optimization of IR
Models
– Salton’s early work on SMART system and experiments
•
1970s-1980s: a large number of retrieval models
– Vector space model
– Probabilistic models
•
1990s: further development of retrieval models and new tasks
– Language models
– TREC evaluation
•
Indexing + Search
Theory
Large-scale evaluation, beyond ad hoc retrieval
2000s-present: more applications, especially Web search and interactions
with other fields
– Web search
– Learning to rank
– Scalability (e.g., MapReduce)
Web search
Machine learning
Scalability
2
Outline
• Milestone 1: Automatic Indexing (Luhn)
• Milestone 2: Evaluation (Cleverdon)
• Milestone 3: SMART system (Salton)
• Milestone 4: Probabilistic model & feedback
(Maron & Kuhns, Rocchio)
3
Background: library search in 1950s
Index cards are sorted in
alphabetical orders:
- Title index
- Author index
- Subject index
Users can only sequentially
search for items
Indexing was done manually
Clear separation of indexing
and search
Card catalogue of Yale Univ.’s Sterling Memorial Library
(picture from Wikipedia)
4
A typical title card (sorted by title)
(source: www.graves.k12.ky.us/powerpoints/elementary/symrrobertson.ppt )
5
What’s on a card?
(source: www.graves.k12.ky.us/powerpoints/elementary/symrrobertson.ppt )
6
Milestone 1: Automatic Indexing
7
Luhn’s ideas: automatic indexing
• Important contributions of
Luhn
– Automatic indexing (using term
frequency to select terms, KWIC)
– Automatic abstracting
(summarization)
Hans Peter Luhn
(IBM)
– Measuring similarity of documents
based on their indexing terms
– Selective dissemination of
information (SDI, i.e., filtering)
– Coined the term “business
intelligence”
8
Luhn’s idea: automatic indexing based
on statistical analysis of text
“It is here proposed that the frequency of word occurrence in an
article furnishes a useful measurement of word significance. It is
further proposed that the relative position within a sentence of
words having given values of significance furnish a useful
measurement for determining the significance of sentences. The
significance factor of a sentence will therefore be based on a
combination of these two measurements. ” (Luhn 58)
LUHN, H.P., 'A statistical approach to mechanised encoding and searching of library
information', IBM Journal of Research and Development, 1, 309-317 (1957).
LUHN, H.P., 'The automatic creation of literature abstracts', IBM Journal of Research
and Development, 2, 159-165 (1958).
9
The notion of “resolving power of a
word”
10
Probabilistic view of association and
proximity
“the method to be developed here is a probabilistic one based on
the physical properties of written texts. No consideration is to be
given to the meaning of words or the arguments expressed by word
combinations. Instead it is here argued that, whatever the topic, the
closer certain words are associated, the more specifically an
aspect of the subject is being treated. Therefore, wherever the
greatest number of frequently occurring different words are found in
greatest physical proximity to each other, the probability is very high
that the information being conveyed is most representative of
the article.” (Luhn 58)
11
Automatic abstracting algorithm [Luhn 58]
The idea of query-specific
summarization
“In many instances condensations
of documents are made
emphasizing the relationship of the
information in the document to a
special interest or field of
investigation. In such cases
sentences could be weighted by
assigning a premium value to a
predetermined class of words.”
12
Key Word in Context (KWIC)
KWIC is an acronym for Key Word In Context, the most common format for
concordance lines. The term KWIC was first coined by Hans Peter Luhn.
Sorted
13
Probabilistic representation and
similarity computation [Luhn 61]
An early idea about using
unigram language model to
represent text
What do you think about
the similarity function?
14
Other early ideas related to indexing
•
•
•
•
•
•
[Joyce & Needham 58]: Relevance-based ranking,
vector-space model, query expansion, connection
between machine translation and IR
[Doyle 62]: Automatic discovery of term
relations/clusters, “semantic road map” for both search
and browsing (and text mining!)
[Maron 61]: automatic text categorization
[Borko 62]: categories can be automatically generated
from text using factor analysis
[Edmundson & Wyllys 61]: local-global relative
frequency (kind of TF-IDF)
Many more (e.g., citation index…)
15
Measuring associations of words [Doyle 62]
16
Word Association Map for Browsing
[Doyle 62]
Imagine this can be further
combined with querying
17
Milestone 2:
Cranfield Evaluation Methodology
18
Background
•
•
•
IR is an empirically defined problem, thus experiments
must be designed to test whether one system is better
than another
However, early work on IR (e.g., Luhn’s) mostly
proposed ideas without rigorous testing
Catalysts for experimental IR:
– Hot debate over different languages for manual indexing
– Automatic indexing vs. manual indexing
• How can we experimentally test an indexing
method?
19
Cleverdon’s Cranfield Tests
1957-1960: Cranfield I
- Comparison of indexing methods
- Controversial results (lots of criticisms)
1960-1966: Cranfield II
- More rigorous evaluation methodology
- Introduced precision & recall
- Decomposed study of each component
Cyril Cleverdon
in an indexing method
(Cranfield Inst. of Tech, UK)
- Still lots of criticisms, but laid the
foundation for evaluation that has a very
long-term and broad impact
Cleverdon received the ACM SIGIR Salton Award in 1991
URL : http://www.sigir.org/awards/awards.html
20
•
•
Cranfield II Test: Experiment Design
Decomposed study of contributions of different
components of an indexing language
Rigorous control of evaluation
– Having complete judgments is more important than having a
large set of documents
– Document collection: 1400 documents (cited papers by 200
authors, no original papers by these authors)
– Queries: 279 questions provided by authors of original papers
– Relevance judgments:
• Multiple levels: 1-5
• Initially done by 6 students in 3 months; final judgments by
the originators
– Measures: precision, recall, fallout, prec-recall curve
– Ranking method: coordination level (# matched terms)
21
Detailed Comparison of Components
22
Measures: Precision, Recall, and
Fallout [Cleverdon 67]
23
Precision-Recall Curve [Cleverdon 67]
24
Cranfield II Test: Results [Cleverdon 67]
For more information about Cranfield II test, see
Cleverdon, C. W., 1967, The Cranfield tests on index language devices. Aslib Proceedings,
19, 173-192.
25
Cranfield II Test: Major Findings
'(1) In the environment of this test, it was shown that the best performance
was obtained by the use of Single Term index languages.
(2) With these Single Term index languages, the formation of groups of
terms or classes beyond the stage of true synonyms or word forms
resulted in a drop of performance.
(3) The use of precision devices such as partitioning and interfixing was
not as effective as the basic precision device of coordination.'
'this test has shown that natural language, with the slight modifications of
confounding synonyms and word forms, combined with simple coordination,
can give a reasonable performance. This means that, based on such
practice, a norm could be established for operational performance in any
subject field, and it would then be for those who proposed new thesauri,
new relational groups, links, or roles, to show how the use of their
techniques would improve on the norm.'
26
Cranfield II Test: Criticisms (Sparck Jones 81)
1. Vickery's remark that the test did not reflect an ordinary operating system
Situation (like Mote's earlier, is inappropriate to an explicitly laboratory test)
2. Swanson's and Harter's claims about the existence of many more relevant
documents than were used are themselves open to a good deal of doubt (they
fall into the class of speculative criticisms). On the other hand, their point
about the assessment procedure is more substantial, though there is no
evidence that, while the procedure could have affected the test results, it
actually did so.
Both Cranfield 1 and Cranfield 2 were comparative tests and
it is therefore necessary, in reviewing criticisms of the two experiments, to
distinguish features of the design and conduct of the tests which could
conceivably have affected comparative performance from those which were
most unlikely in fact to have done so. Many of the criticisms of both tests
failed to take this distinction into account. At the same time, the possibility
that hidden factors may affect performance has to be raised in relation to
every test.
The real defects of Cranfield 2 were the lack of statistical tests, noted by
Vickery,
27
Cranfield Test Methodology
• Specify a retrieval task
• Create a collection of sample documents
• Create a set of topics/queries appropriate for the
retrieval task
• Create a set of relevance judgments (i.e.,
judgments about which document is relevant to
which query)
• Define a set of measures
• Apply a method to (or run a system on) the
collection to obtain performance figures
28
Milestone 3:
SMART IR System
29
Cranfield experiments were
done manually, how about
doing all the experiments with
an automatic system?
SMART System
30
SMART: System for the Mechanical
Analysis and Retrieval of Text
1961-1965: SMART system develop
(Gerard Salton + Michael Lesk)
-First automatic retrieval system
-Term weighting + vector similarity
-Experimented with many ideas for indexing
-Did statistical significance test
-Major findings:
Gerard Salton
+ weighted terms help
(Harvard, Cornell) + automatic indexing is as good as manual indexing
+ Indexing based on abstracts outperforms titles
+ linguistic phrases and statistical phrases are
similar
31
About the SMART system
Developed on IBM 7094
(time-sharing system, 0.35MIPS, 32KB memory)
Early development: (1961-1965):
Michael Lesk
First UNIX implementation(v8, 1980):
Edward Fox
The widely used SMART toolkit
(v10/11, 1980-1990s)
Chris Buckley
SMART was the most popular IR toolkit (in C) widely used in 1990s by
IR researchers and some machine learning researchers
32
Features of SMART system
33
SMART Features (cont.)
34
Overall Results
Title only, overlap similarity, and no weights are clearly the worst
35
Key Findings
• Term weighting is very useful (better than binary
values)
• Cosine similarity is better than the overlap
similarity measure
• Using abstracts for indexing is better than using
titles only
• Synonyms are helpful
• Automatic indexing may be as effective as
manual indexing
36
Milestone 4: Probabilistic model &
feedback
37
Background: Indexing as the central
theme in early days
•
Initial questions
– Is automated indexing feasible?
– How to do automated indexing?
– Luhn’s answer: choosing indexing terms based on term
frequency
•
Next questions
– How to compare different indexing methods?
– Cleverdon’s answer: the “lab test set approach”
•
New questions (more theoretical)
– What’s the optimal way of doing indexing?
– How to model relevance and optimize a retrieval system (beyond
indexing)? (retrieval models)
38
Two major milestones
• Probabilistic indexing (first probabilistic retrieval
model): Maron & Kuhns (Ramo-Wooldridge)
– Maron, M. E. and Kuhns, J. L. 1960. On Relevance,
Probabilistic Indexing and Information Retrieval. J.
ACM 7, 3 (Jul. 1960), 216-244.
• Relevance feedback (optimization of retrieval):
Rocchio & Salton (Cornell)
– J. J. Rocchio, Jr. and G. Salton, Information Search
Optimization and Iterative Retrieval Techniques,
Proceedings of the AFIPS Fall Joint Computer
Conferences, Las Vegas, Nev., Nov. 1965.
39
Probabilistic Indexing [Maron & Kuhns 60]
•
Major contributions:
– Formalization of “relevance” with a probabilistic model
– Ranking documents based on “probable relevance of a
document to a request”
– Optimizing retrieval accuracy = statistical inference
•
Key insights:
– Indexing shouldn’t be a “go or no-go” binary choice
– We need to quantify relevance
– Take a theoretical approach (use a math model)
•
Other contributions:
– Query expansion (leveraging term association)
– Document expansion (pseudo feedback)
– IDF
40
Now, think about how to define
an optimal retrieval model using
a probabilistic framework…
41
Probabilistic Indexing: The Original
Model
42
Probabilistic Indexing Model in More
Conventional Notations
Rank documents based on the following probability (early version of
Probability Ranking Principle):
p ( Di | I j , A) 
Prob. that a user will be
satisfied with Di if the user
requests information on Ij
p ( Di | A) p ( I j | A, Di )
p ( I j | A)
 p ( Di | A) p ( I j | A, Di )
Weight in an
indexer’s mind
Prior probability that Di is relevant Probability that user would
use Ij in query if the user
likes Di
43
However, estimation of
probability is difficult
Experiments were done based on
manual assignment of probabilities
for indexing
44
Proposed Overall Search Strategy
45
Proposed Overall Search Strategy
(cont.)
46
Relevance Feedback [Rocchio 65]
47
Rocchio Feedback Method
nr: number of relevant documents
ns: number of non-relevant documents
48
What You Should Know
• Research in IR was very active in 1950s and
1960s!
• Most foundational work were done at that time
– Luhn’s work proved the feasibility of automatic
indexing, thus automatic retrieval
– Cleverdon’s work set the standard for evaluation
– Salton’s work led to the very first IR system and a
large body of IR algorithms
– Maron & Kuhns proposed the first probabilistic model
for IR
– Rocchio pioneered the use of supervised learning for
IR (relevance feedback)
49
Download