Approaches to Passage Retrieval in Full Text Information Systems

advertisement
Approaches
to Passage
in Full
Information
Text
Gerard
Salton~
J. Allan,
Abstract
Retrieval
Systems
and C. Buckley
mediately
relevant
text passages. Second, the effective-
ness of the retrieval
Large
collections
of full-text
monly
used in automated
the stored document
plete
documents
est.
documents
retrieval.
When
texts are long, the retrieval
of com-
may
information
are now com-
not
because relevant
retrievable
In such circumstances,
efficient
and effective
text
passages responsive
automated
evaluate
encyclopedia
the usefulness
relevant material with
recall and precision.
to particular
user needs.
Most
search system
is used to
ognizable
of the proposed
methods.
text
sentences.
In operational
retrieval
environments,
to process the full text
long, book-sized
of all stored
documents
a mix of different
topics
In these circumstances,
of complete
it is now possible
documents.
are stored,
are naturally
improvements
subdividable
sections,
size covering
in
into rec-
paragraphs,
and
of assembling
text
just
the right
amount
undivided
documents.
Instead
in-
Classical
that are more
user needs than
“ Department
of Computer
NY 14853-7501.
This
Foundation
study
under
Science,
Cornell
was supported
the full docu-
grant
IRI
University,
in part
Methods
commercial
advantage,
of the publication
that
copying
is by permission
To copy
specific
otherwisa,
and
Ithaca,
of tha Association
notice
notica
is
for
and tha
Excerpting
have been introduced
in the past for the use
abstracts,
of important
are then
formed
highly-weighted
by grouping
sentences
into
the individual
text words, and the complete sentence
scores are then based on the occurrence characteristics
a fee
permiaeion.
ACM-SIG IR’93-6/93/Pittsburgh,
PA, USA
@l1993 ACM O-8979 j-60~-O/93/0006
/0049
user
units that are then independently
processed for particular purposes.
Different
methods have been used for the sentence
Typically,
weights are assigned to
scoring process.
ie givan
requiras
Text
sages, or text
for Computing
or to rapublish,
into
to individual
score, or weight, depending on its perceived importance
in the texts under consideration.
Retrievable
text pasa number
copyright
appear,
passages responsive
in response to user
these excerpts
text sentences that are later assembled into retrievable
text units. Typically,
each text sentence is assigned a
89-15847.
the ACM
and its data
text
text excerpts
and assembling
of text passages in information
retrieval and automatic
abstracting.
The standard approaches utilize bottomup procedures based on the identification
of important
by the National
Permission
to copy without
fee all or part of this matwial
grantad provided
that the copias are not made or distributed
titla
relevant
statements,
needs.
ciency of text utilization
may be improved because the
users are no longer faced with large masses of retrieved
materials, but can instead concentrate on the most im-
and/or
As a
the
Several advantages are apparent when individual
text
passages are independently
accessible.
First, the effi-
Machinery.
corresponding
such as text
of varying
retrievable
covered in more or less detail.
it is not useful to maintain
to particular
interest
ment texts.
direct
may not closely re-
specific user queries.
This leads to the notion
for identifying
Many
often containing
text passages should be identified
responsive
Science
subjects
of information
to satisfy the user population.
In the remainder of this study, effective methods are introduced
Introduction
dividual
items
units,
excerpts
integrity
of different
is often higher for the text excerpts than for the corresponding full texts, leading to the retrieval of additional
New approaches are described in this study for implementing selective passage retrieval systems, and identiAn
more easily
result, many potentially
relevant items may be rejected.
When text excerpts are accessible, the query similarity
trieval results may be obtained
by using passage retrieval strategies designed to retrieve text excerpts of
varying size in response to statements of user interest.
fying
are generally
semble narrowly-formulated,
re-
may also be enhanced
texts
than longer ones. The longer items covering
a wide diversity
be in the users’ best inter-
activities
short
of highly-weighted
. ..$J .5Q
49
terms in the respective
sentences.
In
paragraphs
passage retrieval
applications
where the task consists
in retrieving
text excerpts that are similar to available
user queries,
the number
and concentration
of query
words included in the individual
sentences is used to
generate thesentence
score. Increasing concentrationof
query terms, measuredly
the closeness of these terms
in the text sentences, leads to higher assigned sentence
weights. In text abstracting,
placations where user queries
Or text
summarization
text
subject
varying
at varying levels of detail, and responding
kinds of user needs. A top-down
approach
passages of varying
ing increasingly
avail-
tant in representing
text content.
In addition to using the occurrence
user requirements.
be influenced
1. The
of additional
of the sentences
factors
tion
being
titles,
Global-Local
and
under
text sections,
sentences
and sentence
value.
strategies
detected between
3. The use of syntactic relationships
particular
words and sentences-in the text, indicating that the corresponding
text units are related
and ought to be jointly retrieved.
[1-8]
may
by sets, or vectors,
be a word
particular
text,
to favor
terms
stem,
or phrase,
and the term
that
occur
and query texts
of weighted
weights
with
individual
high
frequency
while being relatively
collection
as a whole.
[15] To determine
documents,
sentences
into
meaningful
text
passages
passages. [9-11]
it is difficult
by
using
the
to
produce
low-level
larger
deep semantic
analysis
frames,
ately filled by information
texts.
[12-13]
deep knowledge
ful
for
restricted
techniques,
readable
term-weighting
It is possible
of particular
tasks,
struction
of medical
unrestricted
subject
that
extracted
such
that
stated
subject
as,
for
are appropri-
the
retrieved
and text
have not
term
vectors
are
in decreasing
reflects
order of the
global
texts.
coinci-
The assump-
are therefore
rejected.
The reverse as-
documents
about
in answer to queries about
table
salt might
the SALT
be
treaties.
pends on the use of these words in the language.
[16]
This suggests that linguistic
ambiguities
and multiple
meanings can be eliminated
by checking the local con-
One suggestion that may represent a step in the right
direction is based on the use of text paragraphs, rather
than sentences for the construction
the similarity
or between two
Problems
caused by language
ambiguity
can be
largely solved by making appeal to the “use theory”
of
meaning proposed by Wittgenstein
and others, which
states that the meaning of’ words and expressions de-
con-
summaries of certain types. When
matter must be treated, as is of-
ten the case in practice, the passage retrieval
summarization
methods proposed heretofore
proven equal to the need.
threshold
ity computation,
based on
fields will be useexample,
inside
rare in the
sumption may or may not hold, that is, when the global
similarity
between two vectors is sufficiently
large, the
texts may nevertheless be unrelated because the common vocabulary
may be used in different senses in the
respective texts. Thus, based on a global vector similar-
from the document
approaches
in a
tion is that when two vectors do not exhibit a given minimal global similarity,
the corresponding
texts are not
related. Documents whose query similarity
falls below a
and the use of pre-
or templates,
at the output
dence between query and document
approaches
normally
available for this purpose.
Alternative
strategies have therefore been advocated for
use in text abstracting
and summarization
based on
constructed
the corresponding
computed query similarity.
The global vector similarity
units.
For example, specific rules have been proposed
for generating
so-called connected,
concentrated,
and
compound text
Unfortunately,
A
compared, and coefficients of similarity
are computed
based on the number and the weight of common terms
included in a vector pair. This makes it possible to rank
have also been suggested for assem-
text
included
documents,
stored
are
terms.
are chosen so as
particular
the documents
bling
Analysis
are based on the vector
between a query and a stored document,
procedures
on particular
in Text
model, where document
represented
2. The inclusion in the sentences of special clue words
and clue phrases that are thought
to be impor-
Various
to
is
text paragraphs,
depending
Processing
retrieval
processing
term
of typicality
the
Ret rieval
The Smart
given to
and sec-
headings.
tant in the determination
covering
of
such as:
in the texts
consideration,
special importance
texts representing
figure captions,
full texts,
or sets of adjacent
the sentence scoring system may
by a number
location
characteristics
length,
specific user needs. This makes it possi-
able, the sentence score is similarly based on the number
and concentration
of text words thought to be impor-
terms,
re-
include
ble to retrieve
highly-weighted
purposes,
used whereby large text excerpts are chosen first, that
are successively broken down into smaller pieces cover-
W
are not necessarily
are then chosen for abstracting
placing the originally available texts. [14] In the present
study, the use of complete paragraphs is generalized to
text in which text words and expressions occur.
The
linguistic
context of a term such as “salt” is likely to
of text passages. In
be the same for many different texts dealing with the
SALT treaties.
The same is true for texts discussing
that case, relationships
are computed between individual text paragraphs,
based on the number of common
text components
in the respective paragraphs.
Certain
table salt.
50
On the other hand, the contexts
differ when
Query:
[9562]
Gall
(Parasitic
growth
in plants)
Ret rieved
Document
Query
Number
1.
Unrestricted
Title
Similarity
Simple
Restricted
Sentence
Sentence
Mat ch
Mat ch
output
9567
0.30
GaHe (city
in Sri Lanka)
2.
1494
0.28
Art
3.
9563
0.27
Galla
4.
12134
5.
17675
0.27
0.24
6.
0.24
Turner,
7.
23008
20402
0.24
St. Gallen
8.
11847
0.24
Hymenopteran
9.
10.
17061
22077
0.23
0.23
11.
18414
0.23
Plant
●
●
●
12.
22034
0.23
Tama,risk
●
●
●
●
XN
●
XN
●
●
●
●
●
●
●
●
●
XR
●
XN
●
XN
●
XN
Insect
●
●
Parasite
●
●
●
●
●
XN
●
●
XN
●
●
●
Oak
●
●
●
Tannins
●
●
●
Gallery
(African
people)
J.M.W.
13.
170
0.22
Acorn
14.
9578
0.21
15.
16.
22075
11913
0.21
Gallium
Tannic
17.
18.
(painter)
(city)
(insects)
(cross-reference)
(metal)
Acid
Ichneumon
4953
0.20
0.20
18228
0.20
Phylloxera
Chalcid
Fly
(parasite)
(insect)
●
●
...................................................................... ...........................
19.
7304
0.19
Diseases
20.
23827
0.19
wasD
Table
1:
Effect
of Local
(removal
the query texts are related
Sentence Match
of 7 nonrelevant
to arms control
uments deal with food consumption.
In practice,
a dual text comparison
of Plants
but the doc-
system
Restrictions
for Query
(N) items and 1 relevant
[9562] Gall
(R) item)
The list of retrieved articles shown in Table 1 includes
for the most part relevant items dealing with parasitic
can be
insects that
cause the formation
fortunately,
a number
of galls in plants.
Un-
used, designed to verify both the global similarity
between query and document texts, as well as the coinci-
cluded
on the list of items in Table
1, such as “Galle”,
dence in the respective
a city
in Sri Lanka,
Turner”,
local contexts:
[17,18]
painter.
1. The global text similarity
paring the respective
plained.
larity
2. The
Text
pairs without
are not further
local
for texts
structures
text
with
is computed
sufficient
environments
global
globad simi-
are considered
similarity,
in the texts
The retrieval
next
and local
such as sections,
are retrieved
in answer
number
to the right entitled
substructures,
th{e texts
The effect of the global/local
text comparison system
is illustrated
in the example of Table 1. Here an artiin the Funk
and Wagnalls
a British
by the term
vectors,
so that
terms
such
similarity
to [9562]
Gall
computation.
“simple
by the
simple
The next column
sentence match”
represents
a local retrieval strategy where at least one matching
pair of sentences must be found in query and retrieved
are assumed to be related.
cle included
errors are produced
are in-
as ‘igallery”,
“Galla”,
“gallium”,
and so on, are all reduced to gall, matching the term “gall” in the query.
The left-most column of bullets in Table 1, entitled
“unrestricted
output”,
identifies the top 18 items that
global vector
similar
“J. M.W.
in the query and document
paragraphs
and sentences are compared.
When
globally similar text pairs also contain a sufficient
of locally
and
documents
truncation
used in constructing
the term vectors: normally word stems, rather than complete words, are used
ex-
considered.
sufficient
included
first by com-
term vectors as previously
of extraneous
encyclopedia
document texts above a sentence matching threshold of
75.0 in addition
to a sufficiently
large global similarity. [17,18] As the middle column of bullets shows, six
is
used as a search request, and other related encyclopedia
articles are retrieved in response to the query articles.
[19] Table 1 shows the 20 items exhibiting
the highest
similarity
with the query article “Gall” (article number
of the originally
retrieved items do not fulfill the simple local sentence matching requirement.
This includes
five obviously
nonrelevant
items
(article
[170] Acorn is
[9562]) based on the global vector similarity
between
query and retrieved article texts. Normally
a minimum
cross-reference
not relevant
vant item,
threshold of 0.20 applies for retrieval purposes, and documents with smaller global similarities
are rejected.
because the whole article
article),
[18228] Phylloxera,
tom of the list.
51
to another
that
consists of only a
plus a possibly
appears
rele-
at the bot-
Evaluation
1. Global
Match
Parameters
2. Simple
Text
(un-
Mat
restricted)
Retrieved
all
ch
12,491
1’7,328
queries
Match
Sections
11,296
12,285
after
5 dots
0.5366
0.5820
0.5915
0.6316
after
15 dots
0.7705
0.8027
0.7978
0.8702
. . .. . .. . . .. . .. . .. . .. . .. . .. . . .
Precision
after
Precision
after
.
n-point
5 dots
0.6051
15 dots
.
0.3862
.
Average
.
.
0.7623
+9.170
Precision
Retrieval
.
Evaluation
(partial
relevance
982
Retrieved
all
0.6938
0.4214
0.4466
0.4391
0.8569
+22.576
0.8340
(0.9280)
+19.3%
(+24.7%)
Query
based
Recall
after
.
0.7141
.
after
5 dots
0.7689
0.7867
08067
Precision
after
.
15 dots
0.5022
0.5274
0.5519
0.7520
0.8198
.
Average
0.6920
strategy
Retrieval
relevance
Evaluation
for
assessments
of Global/Local
used to carry
for
Text
a sentence pair has only that
additional
matching
no single term
contributing
local
are
sentence
An evaluation
([23008]
now
match
J.M.W.
rejected
because
of
text
Comparison
and Passage Retrieval
from the owners of the encyclopedia
retrieved
by the simple
of Table 2(a).
sentence
are indeed generally
con-
as nonrelevant
The detailed
shows that items rejected
sin-
match
in the eval-
example
of Table
by the local matching
nonrelevant.
1
process
This is also confirmed
by the detailed recall and precision figures included in
Table 2(a) which both increase sharply between runs 1
and 2.
The precision
figures
included
in Table
2 are exact;
the recall figures must be interpreted
as relative recall representing the total number of retrieved relevant
items divided by the total number of known relevant for
each query. Overall, the average precision improves by 9
restricted
matching
items)
2. All these items are treated
percent over the unrestricted
global text match of run
1 when the simple local sentence match is used, and
requirement.
of the global/local
retrieved
uation
and [20402]
the
Query
(run 2 of Table 2(a)), but not for the extra 5,000 documents retrieved by the unrestricted
run 1 that were
rejected by the local sentence-matching
process of run
more than
Turner
all
“G”
for all articles
90 percent of the total sentence similarity.
The rightmost column of Table 1 shows that the two remaining
items
90
data were obtained
out
in most of the retrieved articles, the
similarity
threshold
of 75.0 may be
without
+18.5%
+9.7%
text.
To avoid that possibility,
a restricted
sentence
match requirement
can be used specifying that at least
two matching terms must be present in a matching sen-
nonrelevant
St. Gallen)
. . . . . . . . .
0.5043
0.7631
.
the simple sentence match is based on a straightforward computation
of pair-wise sentence similarity
between all sentence pairs in the respective query and
document texts. When a term is highly weighted, as is
with
1345
. .
Precision
Table 2: Evaluation
tence pair,
Text
Sections
or
Paragraphs
1204
0.6667
15 dots
(full
in common
Text
. . . . . . . . . . . . . . . . .
0.4834
0.4583
5 dots
b)
gle term
4)
Sections
Precision
even though
run
5. Full
With
1106
after
Ii-point
reached
on
“G”)
Documents
....... . .
matching
(letter
4. output
Sentence
Match
. . . . . . . . . . . . . . . . . . . . . . .
Recall
.
Articles
assessments
queries
. . . .
.
0.7097
. . . .. . . . . . .
0.7779
+11.3~o
for
Parameters
0.8631
.
0.6887
3. Restricted
Evaluation
0.6164
. .. ... . .. . .
0.4167
.
0.6988
a)
..
0.6629
.
13,669
.
Recall
the case for “gall”
required sentence
Paragraphs
......... .
.. . .. . .. .
Recall
sentence
5. Pull Text
Sections
or
Text
Documents
.... ......................
The
4. output
With
3. Restricted
Sentence
Sentence
sys-
by an additional
2 percent
matching system.
tem is presented in the first 3 columns of figures of Table
2(a). The results of Table 2(a) are average recall and
precision figures obtained by using as queries all encyclopedia articles starting with the letter G, and comparing them with all other encyclopedia
articles. A global
similarity
threshold of 0.20 is used, and all query articles
are included in the evaluation
for which the relevance
Retrieval
assessors found at least one relevant document.
This
represents a total of 982 “G” queries.
Full relevance
documents
restricting
of Text
for the restricted
Excerpts
The global/local
text matching
the previous section is capable
-JL
sentence
strategy described in
of retrieving
relevant
with a high degree of accuracy.
However,
the retrieval to full document texts presents
Query:
[9562]
Gall
Ret rieved
Document
Query
Number
Similarity
12134.c1O
0.32
Insect/Reproduction
11847.p6
0.32
Hymenopteran
3.
17675.P6
0.28
Parasite
4.
12134
0.27
Insect
●
0.25
Parasite/Parasitic
0.24
Parasite
●
7.
11847
0.24
Hymenopteran
b
8.
9.
17061
0.23
●
●
22077
0.23
Oak
Tannins
●
●
10.
11.
18414
22034
0.23
0.23
Plant
●
●
Tamtarisk
●
●
12.
11913.p3
0.22
Ichneumon
13.
22075
0.21
Tannic
14.
11913
0.20
Ichneumon
15.
4953
23827.c4
0.20
Chalcid
0.20
Wasp/Characteristics
Most
of the long, discursive
Plants
Fly
●
Acid
Fly
obviously,
documents
the users will
that
can be reduced
enhanced
text passages instead
●
●
●
●
●
●
●
ments, with
be
retrieval
●
.
✎
by making
covering
paragraph
items.
cover a number
which
tion
to retrieve
sections,
43 text
paragraphs
article
describing
0.32, instead
of 0.27 for the full
the material
on insects
1. The section
only, whenever
unretrieved
is replaced
insect
by sec-
reproduction,
retrieval
relevant
from
document.
retrieval
rank
This
lifts
4 to rank
system also raises a previously
item above the retrieval
threshold
of 0.20. Table 3 shows that the article on Wasps [23827]
was originally
rejected because of the low query simi-
system be used which
text
contains
10 of that
and particularly
the reproduction
of gall wasps. Furthermore, the query similarity
for vector [12 134.c1O] is
and the retrieval
it possible
for retrieval
column
The move to section retrieval promotes a number of
retrieved items. Thus, the full article on Insects [12134]
of full documents
text decomposition
the right-most
as well as sections and full text
query similarity
the query similarity
of a text excerpt is larger th~an the
similarity
of the complete document.
This suggests that
considers
●
Effect of Retrieval of Text Excerpts (Sections and Paragraphs)
for Query [9562] Gall (ci := section i; pj paragraph j)
topics will be low, implying
that many long
will be rejected, even when they contain rel-
The user overload
●
..............
evant passages.
successively
●
●
17675
occurs because the global
a hierarchical
output
17675.c5
problems.
effectiveness
Sections
●
overloaded rapidly when large document texts are involved,
Second, a potential
loss in retrieval effective-
of different
documents
Mat ch
5.
Table 3:
ness (recall)
Paragraph
●
.................................... .....
23827
0.19
17.
Wasp
two main
With
6.
16.
and
output
Sentence
Tit le
1.
9
-.
Section
Restricted
larity
text
of 0.19. The query similarity
for Section 4 of that
paragraphs, and sets of adjacent text sentences. In each
case, the text excerpt with the highest query similarity
may be presented to the user first, while providing op-
article ([23827.c4] ) reaches 0.20, leading to the retrieval
of that relevant item.
Additional
relevant items are promoted
when para-
tions for obtaining
graphs are added to the retrieval process as shown in
the right-most
column of Table 3. Thus, [11913] Ich-
of the local
larger or smaller text pieces. Because
context
checking
requirement
used for re-
neumon
trieval purposes, text excerpts such as paragraphs and
sentences are already individually
accessible, and the
additional
resources
needed for passage retrievid
poses are relatively
modest.
The basic operations of a section
and paragraph
Fly is lifted
Hymenopteran
pur-
tion
from
rank
5 on Parasitic
Plants
re-
The
further
12, [11847]
([17675 .c5]) is replaced
paragraph 6 of that document
from rank 5 to rank 3.
trieval capability
are illustrated
in Table 3 for the query
[9562] Gall, previously
used in the illustration
of Table
1. The left-most
column of bullets in Table 3 corresponds to the restricted
sentence match retrieving
full
documents only (equivalent
to the right-most
column
of bullets in Table 1). In the middle column of Table
3, text sections are retrievable
in addition to full docu-
14 to rank
is raised from rank 7 to rank 2, and secby
( [17675 .p6]) and raised
effects of the passage retrieval
capability
are
detailed in the example of Table 4, covering
a retrieval
run for query [9561] on Galileo.
Here all
items retrieved
by the restricted
sentence match are
relevant to the query, but the section and paragraph
retrieval strategies are able to locate large numbers of
previously unretrieved
relevant items. A comparison of
!53
Query:
[9561]
Galileo
Restricted
Number
Document
Similarity
1.
18179.P72
0.58
Philosophy
1
●
2.
12083.P6
0.49
Inertia
1
●
3.
2501.P5
0.45
Bellarmine,
1
●
4.
1640.P22
0.45
Astronomy
1
0.44
Physics/History
2
●
1640. c8
0.43
Astronomy
3
e
7.
6165.P16
0.39
Copernicus
8.
18179.c25
0.36
Philosophy
4
9.
20683.P16
0.35
Science
1
10.
11.
6165.P17
6165.c7
6284.P5
0.32
0.31
Copernicus,
N.
1
Copernicus,
N.
0.31
cosmology
Piss
4
1
18234.c8
6.
12.
Paragraphs
Title
0.30
18353.P6
13.
14.
15.
6284.c4
0.30
16.
20683.P13
0.28
Science
Copernicus,
Science
N.
Copernican
System
Bellarmine,
Physics
St.
17.
6165
0.28
18.
20683
19.
6164
0.27
0.27
20.
2501
text
0
●
●
1
✎
1
2
e
1
e
●
13
e
2.5
●
o
4
●
●
6
1
e
●
6419.c5
0.26
Creation
3
e
23.
18234.c9
0.26
2
e
24.
25.
20686.P6
19463.P17
0.24
0.24
Physics
Scientific
26.
27.
22223.P5
1640
0.23
0.23
28.
20686
Renaissance
Telescope
1
1
0.22
29.
1396.c7
0.21
Aristotle
30.
1636, D6
0.20
Astrology
1
12132.;16
0.20
Inquisition
1
Tot al Number
Total
Retrieved
Avg.
Number
of Retrieved
Paragraphs
of Paragraphs
Effect
of bullets
retrieved
items
two text
in Table
full-text
plus 8 text
items
sections,
column
(39 paragraphs),
are subdivided
●
Per
6
95
15.8
of Text Excerpts
z ; Pj = paragraph
12
65
5.4
for Query
[9561] Galileo
of Table 4. When full documents
the average number
by
of paragraphs
(see the middle
trieved
item
is nearly
and by one full-
drops
to only
a little
and 17 text
paragraph
items,
paragraphs
item
in
in the section
on average
and section
2;
25
1.3
j)
bottom
the
●
e
Item
are replaced
●
e
Items
4 shows that
sections
covering
e
e
39
8
2
Method
of Retrieval
section retrieval),
*
.
1
Astronomy
Scientific
retrieval in addition to full texts.
All the long, originally
retrieved
Astronomy
paragraphs),
e
18234.P19
Method
●
*
21.
22,
representing
the right-most
☛
0.27
0.26
columns
item,
Robert
output
Sections
1
N.
Creation
Cosmology
(ci = section
columns
Robert
0.30
Table 4:
four full text
St.
6419.P1O
31.
six originally
Section
and
Paragraph
Query
5.
the three
output
With
Sentence
Match
of
Retrieved
for
the
are retrieved,
contained
16 for the Galileo
over 5 paragraphs
mode,
and to only
combined
section
in each requery.
This
per retrieved
1.3 paragraphs
plus
paragraph
mode. For the 982 query articles that start with the
letter G, the average reduction in retrieved text length
such as [1640]
is 43 percent for the section retrieval,
and [20683] Science (25
in the passage retrieval
paragraph
retrieval.
and 54 percent
The users necessarily
benefit
for
when
shorter text excerpts are obtained that are immediately
related to the query topic instead of only full-text
items.
The effectiveness of the section and paragraph
re-
mode, and the retrieved
sections and paragraphs
are
much more closely related to the query topic [9561]
Galileo than the original full articles.
Furthermore,
a
large number of long, originally
rejected articles are retrieved in the passage retrieval
mode, including
very
long items such as the documents
on Physics [18234]
trieval is reflected
the two right-most
by the evaluation data appearing in
columns of Table 2(a), covering sec-
tion, and section plus paragraph retrieval, respectively,
The recall and precision figures show that the passage
retrieval techniques furnish substantially
better output
than the standard restricted sentence match for full doc-
containing
146 paragraphs
of text, and on Philosophy
[18179] containing
117 paragraphs.
Many shorter relevant items are also obtained for the first time in the
passage retrieval mode, including
[6284] Cosmology (30
paragraphs),
[12132] Inquisition
(21 paragraphs),
and
[19463] Renaissance (27 paragraphs).
The efficiency improvements
provided by the section
and paragraph
retrieval
modes are illustrated
at the
uments alone. The average precision data of Table 2(a)
show that the section output (run 4) is about 9 percent
better than the restricted sentence match of run 3. This
indicates that the vast majority
of the 1,000 or so additional
54
documents
obtainable
through
the section output
are indeed relevant,
Tables 3 and 4.
Unfortunately,
as shown earlier
are understated
in Table
retrieval
2(a).
obtained
system
Full relevance
effectiveness
output.
shown
(run 5)
The improvement
in the right-most
in
data were available
the apparent
for these items.
deterioration
and paragraph
in performance
outputs
(runs
of the
for which
the most
sections
and text
queries
complete
evaluation
previously
full relevance
included
stantially
be-
outlined
closely
matching
in answer
discussion,
to availrelatively
the “effect of Galileo’s
thought of the times”,
and methods
variable-length
must
retrieval
depending
be introduced
output
on particular
at various
user require-
ments.
data, for 90
data were available.
paragraphs
In the earlier
narrowed,
for providing
4 and 5 of
in Table
strategies
or “the effect of gall wasps on the health of oak trees”.
In these circumstances
the search context must be sub-
This
levels of detail
2(b) contains
“G”
retrieval
are used to retrieve
specific contexts, for example,
discoveries on the philosophical
Table 2(a).
Table
Re-
general query statements were used to specify the retrieval context,
covering, for example,
“the life and
works of Galileo”, or “characteristics
and causes of galls
in plants”.
In many practical environments,
it is desirable to extend the retrieval capability
to much more
(column
trieval of [12083.P6] Inertia and [18353.P6] Piss in Table
4, are all treated as nonrelevant
in the evaluation since
tween section
Passage
earlier
able query articles,
are promoted
by the section and paragraph
retrieval
system. The newly retrieved, originally
rejectedl, items
obtained in the paragraph mode, exemplified by the re-
explains
Automatic
The section and paragraph
text
but not
of Table 2(a) is then due entirely to changes in the retrieval ranks of originally
retrieved relevant iterns that
no relevance
Flexible
by
irlforma-
only for the section retrieval,
for section plus paragraph
retrieval
improvements
and paragraph
tion was available
Toward
of
trieval
the actual
the full section
in the examples
One possibility
2(a)
consists in using search strategies
pable of isolating
Table 2(b)
variable-length
text
fragments
cafrom
shows that the actual improvement
to be expected with
the section output
is about 10 percent over the re-
a variety of related documents,
to be assembled into
retrieval units covering the desired subject areas in de-
stricted
tail. Consider, as an example, a request dealing with the
philosophical
influence of Galileo.
It seems reasonable
sentence
improvement
tional
match,
with
an additional
for the paragraph
8 percent
improvement
output.
If that
were applied
put of Table 2(a) for the complete
8 lpercent
addi-
to the out-
query set, the average
set of 982 queries.
These extrapolated
fig-
The
performance
and text
of the global/local
systems
of the extrapotext
that
[9561]) as a query, and
cle [9561] Galileo whose global similarity
with [18179]
Philosophy
exceeds 0.10. Table 5(b) similarly
contains
three-sentence
passage retrieval
(document
context,
at a time, it is possible to isolate n-sentence fragments
that closely match a given query context.
Table 5(a)
shows some typical three-sentence fragments from arti-
that sections and paragraphs retrieved by the passage
retrieval system are indeed relevant to the query topic,
the accuracy
is, us-
[18179]) in the Galileo
then of the Galileo article in the context of Philosophy.
By using a “moving window” containing
n sentences
ures are shown in parentheses to the right of column 5
in Table 2(a). The Galileo example of Table 4 shows
confirming
to some extent
lated performance
figures.
article
(document
ing the Galileo article
precision for run 5 would increase to 0.9280, giving an
advantage of nearly 25 percent over the base run 1 for
the complete
to carry out a dual search first of the Philosophy
tained
mi~tching
To generate
can be assessed by
fragments
from
in response to query
answers
[18179] Philosophy
ob-
[9561] Galileo.
in response
first
various sources (articles [9561] Galileo and [18179] Philosophy in the present case). Sample similarity
mea-
retrieves
a large number
of items,
many
of which
are deleted by the local sentence-matching
process. The
local text match thus acts as a precision device that
rejects many previously
ducing the total
retrieved
The passage retrieval
retrieved
nonrelevant
related
of Galileo’s
queries
noting the variations in the figures given in the first and
last rows of Table 2. The unrestricted
global text match
be useful to combine
impact
to specific
about the philosophical
fragments
work, it may
obtained
from
surements between the three-sentence fragments of Table 5 are shown in Table 6(a). An examination
of the
fragment texts reveals close relationships
in many cases.
items re-
from 17,328 to 11,296 overall.
For example,
system then acts as a recall device
fragments
the sun. These fragments would therefore be combined
in answer to an appropriate
query. Similarly fragments
prove retrieval
the unrestricted
s49-51 from [9561] Galileo and s251-253 from [18179]
Philosophy
specifically
deal with the role of Galileo in
by about
match.
25 percent
over
theory
both
deal with
[9561] and
of retrieved items from about 11,300 to about 13,200.
The combined effect of these procedures appears to imeffectiveness
global text
[18179]
article
s234-236
Copernican
from
516-18 from
by adding text excerpts from relevant items that were
not obtained earlier, and increasing the total number
for the movement
aspects
of planets
of the
around
shaping the philosophical
thought of his time by introducing experimental
approaches in physics.
The text
passage constructed
from the text fragments
used in
55
the last example
is reproduced
in Table
6(b).
12. U. Hahn
and
und Architektur
TOPIC, Report
Such a
passage may be generated automatically
in answer to
queries about the philosophical
contributions
of Galileo.
Detailed
experiments
of this type remain
with
text
to be carried
stanz, Germany,
passage construction
out.
13. U. Reimer
References
University
The Automatic
Creation of Literature
Abstracts,
IBM Journal of Resea?’ch and Deuelop ment 2:2, April 1958, 159-165.
1. H.P. Luhn,
of the ACM,
4:5, May 1961,
Problems
in Automatic
Ab3. H.P. Edmundson,
stracting,
Communications
of the ACM, 7:4, April
1964, 259-263.
Abstracting
and Indexing
tive Abstracts
5. L.L.
Experiments
Criteria,
Infer-
in Automatic
turing
trieval
Research,
van Rljsbergen
terworths,
Phrases,
R.N.
Oddy,
and P.W.
London,
opment
editors,
1975,
Answer
Passage
Retrieval
of the American
Science, 32:4, July
Society
155-
by Text
for
articles,
But-
Data Retrieval
by Text Searching,
10. J. O’Connor,
Journal
of Chemical Information
and Computer
Sciences, 17, 1977, 181-186.
Journal
Machinery,
the article
164.
Searching,
~ormation
Global
Text
Automatic
– Experiments
Searching,
Conference
C!.J.
Retrieval
of Answer Sentences and
9. J. O’Connor,
Answer Figures by Text Searching,
Information
11. J. O’Connor,
Publishing
Com-
Matching
Science 253:5023,30
and C. Buckley,
for cross-reference
1981, 172-191.
11:5/7,
Retrieval,
in Information
cyclopedia
Re-
S.E. Robertson,
and Management,
Wesley
Au-
Text
In-
1980,227-239.
56
entitled
Struc-
in Automatic
Proc.
Fourteenth
Int.
on Research and DevelRetrieval,
New York,
Association
ranging
articles
“United
for
1991, 21-30.
19, Funk and Wagnalls New Encyclopedia,
Wagnalls, New York, 1979, 29 volumes,
Literature
Abstracts
by
8. C.D. Paice, Constructing
Computer:
Techniques and Prospects, Information
Processing and Management,
26:1, 1990, 171-186.
Processing
1987.
Philosophical
Investigations,
and Mott Ltd., Oxford, England,
and Retrieval
Computing
in information
Williams,
December
1989.
and C. Buckley,
Encyclopedia
ACM/SIGIR
Generation
of Literature
7. C.D. Paice, Automatic
Abstracts – An Approach Based on the Identification of Self Indicating
Wittgenstein,
18. G. Salton
Retrieval,
2:4, 1958, 354-361.
search and Development,
as
MIP-8723,
gust 1991, 1012-1015.
Index for Technical
IBM Journal of Re-
Man-Made
6. P.B. Baxendale,
Literature
– An Experiment,
MA,
for Information
Extracting
and
Addison
17. G. Salton
of
1964, 260-274.
Storage
and Indexing,
Information
6:4, October 1970, 313-334.
Condensation
Report
Text Processing – The Transand Retrieval of Information
pany, Reading,
Basil Blackwell
1953.
of IndicaJournal
Automatic
Analysis,
by Computer,
16. L.
Automatic
of Contextual
Coherence
July-August
22:4,
Earl,
– Production
by Application
ence and Syntactic
the ASIS,
and A. Zamora,
Text
Base Abstraction,
of Passau, Germany,
15. G. Salton,
formation,
226-234.
4. J.E. Rush, R. Salvador,
1985.
14. J.W. McInroy,
A Concept Vector Representation
of the Paragraphs in a Document
Applied to Automatic Extracting,
Report TR78-001,
Computer
Science Department,
University
of North Carolina,
Chapel Hill, NC, 1978.
and R.E. Wyllys, Automatic
Ab2. H.P. Edmundson
stracting and Indexing – Survey and RecommendaCommunications
April
and U. Hahn,
Knowledge
tions
U. Reimer,
Entwurfsprinzipien
des Textkondensierungssy
stems
TOPIC 14/85, University
of Kon-
in length
Funk and
25,000 en-
from one line
to 150 pages of text
States of America”.
for
Similarity
Sentence
Coefficient
Adjacent
Numbers
S16-18
At
Padua,
ical
Galileo
problems.
Copernican
s28-30
Professors
there.
Theory
Galileo
rejecting
the
planets
Galileo’s
revolves
a fixed
exist
with
on floating
earth
circle
Galileo’s
could
also disputed
to careful
around
He showed
the
sun–to
interest
in
(see Astronomy:
the
Aristotelian
and
becanse
Aristotle
and that
Four
nothing
had held
new could
and Piss over
printed
attacks
that
hydrostatics,
on this
only
and he
book
followed,
physics.
interference
stands
a) Typical
beyond
15th
and
a tendency
anticipated
the earth
16th
toward
the
work
moved
centuries
of the
around
he also conceived
Polish
of the universe
Giordano
Bruno, who similarly
implications
of the Copernican
The work
Galileo
scientific
of Galileo
brought
laws,
the principles
This
as infinite
identified
theory.
to the motions
b) Typical
Three-Sentence
Obtained
Table 5:
Typical
(s=
and identical
Three-Sentence
in nature
with
mathematics
science
that
philosopher
the philosophical
of a new world
view.
to the formulation
of mechanics,
which
of
applied
of hodies.
Fragments
in Response
Fragments
sentence)
57
from
[18179]
to [9561]
Galileo
Extracted
0.1870
of the universe;
The Italian
developed
by
of Cuss
in his suggestion
in the development
the
Nicholas
the center
God.
God,
was accompanied
prelate
from
with
of applying
by creating
of geometry
Galileo
Copernicus
humanity
importance
[9561]
Catholic
Nicolaus
the universe
to the importance
he accomplished
new vistas in astronby philosophical
and
Philosophy
interest
Roman
displacing
was of even grenter
attention
The
astronomer
thus
from
to [18179]
of scientific
mysticism.
the sun,
Fragments
in Response
a revival
pantheistic
0.1912
science.
Three-Sentence
Obtained
In the
0.1958
ever appear
Galileo’s
most valuable
scientific
ccmtribution
was his founding
of physics on precise measurements
rather than on metaphysical
principles
and formal logic. More widely influential,
theological
s251-253
the
of pendu-
little
theory
however,
were The Starry
Messenger
and the Dialogue,
which opened
omy. Galileo’s
lifelong struggle
to free scientific
inquiry
from restriction
s234-236
discovered
0.1012
earth.
at Florence
in 1612,
of mathemat-
the motions
the Copernican
discoveries
professors
solution
studied
of materials.
in the heavens
bodies
With
measurements,
of projectiles,
and the strength
scorned
bodies
a book
physics
path
Sentences)
for the practical
in 1!595 he preferred
)-that
that
of philosophy
spherical
published
s49-51
beginning
assumption
perfectly
speculative
mechanics
although
Ptolemaic
from
“compass”
and of the parabolic
and investigated
astronomy,
The
bodies
Fragments(3
a calculating
He turned
law of falling
lums,
invented
Sentence
Philosophy
from Related
Documents
0.5800
Query
Three-Sentence
Three-Sentence
Fragment
Fragment
[9561]
[18179]
Galileo
Fragment
Philosophy
Similarity
18179 .s251-253
0.4484
9561 .s49-51
18179 .s251-253
0.4040
9561. s16-18
18179 .s234-236
0.2002
9561. s16-18
18179 .s251-253
0.1816
Typical
The
work
a new
Fragment
Similarity
Measurements
Fragments
of Table 5
of Galileo
world
view.
was
of even
Galileo
the motions
greater
brought
mathematics
to the formulation
creating
the science of mechanics,
9561. s49–51
Pairwise
From
9561 .s28-30
a)
18179 .s251–253
From
for
importance
attention
to the
in the
development
importance
of
of applying
of scientific
laws.
This he accomplished
which applied the principles
of geometry
by
to
of bodies.
Galileo’s
most valuable
scientific
contribution
was his founding
of physics on
precise measurements
rather than on metaphysical
principles
and formal logic.
More
widely
logue,
which
scientific
stands
b) Extract
influential,
opened
inquiry
beyond
Covering
however,
new vistas
from
restriction
were
The
in astronomy.
Starry
Galileo’s
by philosophical
Messenger
and
lifelong
struggle
and theological
science.
“Galileo’s
Role in Shaping
Table 6: Text Passage Formed
58
Philosophical
by Fragment
Thought”
Juxtaposition
the
Dia-
to free
interference
Download