Approaches to Passage Retrieval in Full Text Information Systems

Approaches to Passage in Full Information Text Gerard Salton~ J. Allan, Abstract Retrieval Systems and C. Buckley mediately relevant text passages. Second, the effectiveness of the retrieval Large collections of full-text monly used in automated the stored document plete documents est. documents retrieval. When texts are long, the retrieval of com- may information are now com- not because relevant retrievable In such circumstances, efficient and effective text passages responsive automated evaluate encyclopedia the usefulness relevant material with recall and precision. to particular user needs. Most search system is used to ognizable of the proposed methods. text sentences. In operational retrieval environments, to process the full text long, book-sized of all stored documents a mix of different topics In these circumstances, of complete it is now possible documents. are stored, are naturally improvements subdividable sections, size covering in into rec- paragraphs, and of assembling text just the right amount undivided documents. Instead in- Classical that are more user needs than “ Department of Computer NY 14853-7501. This Foundation study under Science, Cornell was supported the full docu- grant IRI University, in part Methods commercial advantage, of the publication that copying is by permission To copy specific otherwisa, and Ithaca, of tha Association notice notica is for and tha Excerpting have been introduced in the past for the use abstracts, of important are then formed highly-weighted by grouping sentences into the individual text words, and the complete sentence scores are then based on the occurrence characteristics a fee permiaeion. ACM-SIG IR’93-6/93/Pittsburgh, PA, USA @l1993 ACM O-8979 j-60~-O/93/0006 /0049 user units that are then independently processed for particular purposes. Different methods have been used for the sentence Typically, weights are assigned to scoring process. ie givan requiras Text sages, or text for Computing or to rapublish, into to individual score, or weight, depending on its perceived importance in the texts under consideration. Retrievable text pasa number copyright appear, passages responsive in response to user these excerpts text sentences that are later assembled into retrievable text units. Typically, each text sentence is assigned a 89-15847. the ACM and its data text text excerpts and assembling of text passages in information retrieval and automatic abstracting. The standard approaches utilize bottomup procedures based on the identification of important by the National Permission to copy without fee all or part of this matwial grantad provided that the copias are not made or distributed titla relevant statements, needs. ciency of text utilization may be improved because the users are no longer faced with large masses of retrieved materials, but can instead concentrate on the most im- and/or As a the Several advantages are apparent when individual text passages are independently accessible. First, the effi- Machinery. corresponding such as text of varying retrievable covered in more or less detail. it is not useful to maintain to particular interest ment texts. direct may not closely re- specific user queries. This leads to the notion for identifying Many often containing text passages should be identified responsive Science subjects of information to satisfy the user population. In the remainder of this study, effective methods are introduced Introduction dividual items units, excerpts integrity of different is often higher for the text excerpts than for the corresponding full texts, leading to the retrieval of additional New approaches are described in this study for implementing selective passage retrieval systems, and identiAn more easily result, many potentially relevant items may be rejected. When text excerpts are accessible, the query similarity trieval results may be obtained by using passage retrieval strategies designed to retrieve text excerpts of varying size in response to statements of user interest. fying are generally semble narrowly-formulated, re- may also be enhanced texts than longer ones. The longer items covering a wide diversity be in the users’ best inter- activities short of highly-weighted . ..$J .5Q 49 terms in the respective sentences. In paragraphs passage retrieval applications where the task consists in retrieving text excerpts that are similar to available user queries, the number and concentration of query words included in the individual sentences is used to generate thesentence score. Increasing concentrationof query terms, measuredly the closeness of these terms in the text sentences, leads to higher assigned sentence weights. In text abstracting, placations where user queries Or text summarization text subject varying at varying levels of detail, and responding kinds of user needs. A top-down approach passages of varying ing increasingly avail- tant in representing text content. In addition to using the occurrence user requirements. be influenced 1. The of additional of the sentences factors tion being titles, Global-Local and under text sections, sentences and sentence value. strategies detected between 3. The use of syntactic relationships particular words and sentences-in the text, indicating that the corresponding text units are related and ought to be jointly retrieved. [1-8] may by sets, or vectors, be a word particular text, to favor terms stem, or phrase, and the term that occur and query texts of weighted weights with individual high frequency while being relatively collection as a whole. [15] To determine documents, sentences into meaningful text passages passages. [9-11] it is difficult by using the to produce low-level larger deep semantic analysis frames, ately filled by information texts. [12-13] deep knowledge ful for restricted techniques, readable term-weighting It is possible of particular tasks, struction of medical unrestricted subject that extracted such that stated subject as, for are appropri- the retrieved and text have not term vectors are in decreasing reflects order of the global texts. coinci- The assump- are therefore rejected. The reverse as- documents about in answer to queries about table salt might the SALT be treaties. pends on the use of these words in the language. [16] This suggests that linguistic ambiguities and multiple meanings can be eliminated by checking the local con- One suggestion that may represent a step in the right direction is based on the use of text paragraphs, rather than sentences for the construction the similarity or between two Problems caused by language ambiguity can be largely solved by making appeal to the “use theory” of meaning proposed by Wittgenstein and others, which states that the meaning of’ words and expressions de- con- summaries of certain types. When matter must be treated, as is often the case in practice, the passage retrieval summarization methods proposed heretofore proven equal to the need. threshold ity computation, based on fields will be useexample, inside rare in the sumption may or may not hold, that is, when the global similarity between two vectors is sufficiently large, the texts may nevertheless be unrelated because the common vocabulary may be used in different senses in the respective texts. Thus, based on a global vector similar- from the document approaches in a tion is that when two vectors do not exhibit a given minimal global similarity, the corresponding texts are not related. Documents whose query similarity falls below a and the use of pre- or templates, at the output dence between query and document approaches normally available for this purpose. Alternative strategies have therefore been advocated for use in text abstracting and summarization based on constructed the corresponding computed query similarity. The global vector similarity units. For example, specific rules have been proposed for generating so-called connected, concentrated, and compound text Unfortunately, A compared, and coefficients of similarity are computed based on the number and the weight of common terms included in a vector pair. This makes it possible to rank have also been suggested for assem- text included documents, stored are terms. are chosen so as particular the documents bling Analysis are based on the vector between a query and a stored document, procedures on particular in Text model, where document represented 2. The inclusion in the sentences of special clue words and clue phrases that are thought to be impor- Various to is text paragraphs, depending Processing retrieval processing term of typicality the Ret rieval The Smart given to and sec- headings. tant in the determination covering of such as: in the texts consideration, special importance texts representing figure captions, full texts, or sets of adjacent the sentence scoring system may by a number location characteristics length, specific user needs. This makes it possi- able, the sentence score is similarly based on the number and concentration of text words thought to be impor- terms, re- include ble to retrieve highly-weighted purposes, used whereby large text excerpts are chosen first, that are successively broken down into smaller pieces cover- W are not necessarily are then chosen for abstracting placing the originally available texts. [14] In the present study, the use of complete paragraphs is generalized to text in which text words and expressions occur. The linguistic context of a term such as “salt” is likely to of text passages. In be the same for many different texts dealing with the SALT treaties. The same is true for texts discussing that case, relationships are computed between individual text paragraphs, based on the number of common text components in the respective paragraphs. Certain table salt. 50 On the other hand, the contexts differ when Query: [9562] Gall (Parasitic growth in plants) Ret rieved Document Query Number 1. Unrestricted Title Similarity Simple Restricted Sentence Sentence Mat ch Mat ch output 9567 0.30 GaHe (city in Sri Lanka) 2. 1494 0.28 Art 3. 9563 0.27 Galla 4. 12134 5. 17675 0.27 0.24 6. 0.24 Turner, 7. 23008 20402 0.24 St. Gallen 8. 11847 0.24 Hymenopteran 9. 10. 17061 22077 0.23 0.23 11. 18414 0.23 Plant ● ● ● 12. 22034 0.23 Tama,risk ● ● ● ● XN ● XN ● ● ● ● ● ● ● ● ● XR ● XN ● XN ● XN Insect ● ● Parasite ● ● ● ● ● XN ● ● XN ● ● ● Oak ● ● ● Tannins ● ● ● Gallery (African people) J.M.W. 13. 170 0.22 Acorn 14. 9578 0.21 15. 16. 22075 11913 0.21 Gallium Tannic 17. 18. (painter) (city) (insects) (cross-reference) (metal) Acid Ichneumon 4953 0.20 0.20 18228 0.20 Phylloxera Chalcid Fly (parasite) (insect) ● ● ...................................................................... ........................... 19. 7304 0.19 Diseases 20. 23827 0.19 wasD Table 1: Effect of Local (removal the query texts are related Sentence Match of 7 nonrelevant to arms control uments deal with food consumption. In practice, a dual text comparison of Plants but the doc- system Restrictions for Query (N) items and 1 relevant [9562] Gall (R) item) The list of retrieved articles shown in Table 1 includes for the most part relevant items dealing with parasitic can be insects that cause the formation fortunately, a number of galls in plants. Un- used, designed to verify both the global similarity between query and document texts, as well as the coinci- cluded on the list of items in Table 1, such as “Galle”, dence in the respective a city in Sri Lanka, Turner”, local contexts: [17,18] painter. 1. The global text similarity paring the respective plained. larity 2. The Text pairs without are not further local for texts structures text with is computed sufficient environments global globad simi- are considered similarity, in the texts The retrieval next and local such as sections, are retrieved in answer number to the right entitled substructures, th{e texts The effect of the global/local text comparison system is illustrated in the example of Table 1. Here an artiin the Funk and Wagnalls a British by the term vectors, so that terms such similarity to [9562] Gall computation. “simple by the simple The next column sentence match” represents a local retrieval strategy where at least one matching pair of sentences must be found in query and retrieved are assumed to be related. cle included errors are produced are in- as ‘igallery”, “Galla”, “gallium”, and so on, are all reduced to gall, matching the term “gall” in the query. The left-most column of bullets in Table 1, entitled “unrestricted output”, identifies the top 18 items that global vector similar “J. M.W. in the query and document paragraphs and sentences are compared. When globally similar text pairs also contain a sufficient of locally and documents truncation used in constructing the term vectors: normally word stems, rather than complete words, are used ex- considered. sufficient included first by com- term vectors as previously of extraneous encyclopedia document texts above a sentence matching threshold of 75.0 in addition to a sufficiently large global similarity. [17,18] As the middle column of bullets shows, six is used as a search request, and other related encyclopedia articles are retrieved in response to the query articles. [19] Table 1 shows the 20 items exhibiting the highest similarity with the query article “Gall” (article number of the originally retrieved items do not fulfill the simple local sentence matching requirement. This includes five obviously nonrelevant items (article [170] Acorn is [9562]) based on the global vector similarity between query and retrieved article texts. Normally a minimum cross-reference not relevant vant item, threshold of 0.20 applies for retrieval purposes, and documents with smaller global similarities are rejected. because the whole article article), [18228] Phylloxera, tom of the list. 51 to another that consists of only a plus a possibly appears rele- at the bot- Evaluation 1. Global Match Parameters 2. Simple Text (un- Mat restricted) Retrieved all ch 12,491 1’7,328 queries Match Sections 11,296 12,285 after 5 dots 0.5366 0.5820 0.5915 0.6316 after 15 dots 0.7705 0.8027 0.7978 0.8702 . . .. . .. . . .. . .. . .. . .. . .. . .. . . . Precision after Precision after . n-point 5 dots 0.6051 15 dots . 0.3862 . Average . . 0.7623 +9.170 Precision Retrieval . Evaluation (partial relevance 982 Retrieved all 0.6938 0.4214 0.4466 0.4391 0.8569 +22.576 0.8340 (0.9280) +19.3% (+24.7%) Query based Recall after . 0.7141 . after 5 dots 0.7689 0.7867 08067 Precision after . 15 dots 0.5022 0.5274 0.5519 0.7520 0.8198 . Average 0.6920 strategy Retrieval relevance Evaluation for assessments of Global/Local used to carry for Text a sentence pair has only that additional matching no single term contributing local are sentence An evaluation ([23008] now match J.M.W. rejected because of text Comparison and Passage Retrieval from the owners of the encyclopedia retrieved by the simple of Table 2(a). sentence are indeed generally con- as nonrelevant The detailed shows that items rejected sin- match in the eval- example of Table by the local matching nonrelevant. 1 process This is also confirmed by the detailed recall and precision figures included in Table 2(a) which both increase sharply between runs 1 and 2. The precision figures included in Table 2 are exact; the recall figures must be interpreted as relative recall representing the total number of retrieved relevant items divided by the total number of known relevant for each query. Overall, the average precision improves by 9 restricted matching items) 2. All these items are treated percent over the unrestricted global text match of run 1 when the simple local sentence match is used, and requirement. of the global/local retrieved uation and [20402] the Query (run 2 of Table 2(a)), but not for the extra 5,000 documents retrieved by the unrestricted run 1 that were rejected by the local sentence-matching process of run more than Turner all “G” for all articles 90 percent of the total sentence similarity. The rightmost column of Table 1 shows that the two remaining items 90 data were obtained out in most of the retrieved articles, the similarity threshold of 75.0 may be without +18.5% +9.7% text. To avoid that possibility, a restricted sentence match requirement can be used specifying that at least two matching terms must be present in a matching sen- nonrelevant St. Gallen) . . . . . . . . . 0.5043 0.7631 . the simple sentence match is based on a straightforward computation of pair-wise sentence similarity between all sentence pairs in the respective query and document texts. When a term is highly weighted, as is with 1345 . . Precision Table 2: Evaluation tence pair, Text Sections or Paragraphs 1204 0.6667 15 dots (full in common Text . . . . . . . . . . . . . . . . . 0.4834 0.4583 5 dots b) gle term 4) Sections Precision even though run 5. Full With 1106 after Ii-point reached on “G”) Documents ....... . . matching (letter 4. output Sentence Match . . . . . . . . . . . . . . . . . . . . . . . Recall . Articles assessments queries . . . . . 0.7097 . . . .. . . . . . . 0.7779 +11.3~o for Parameters 0.8631 . 0.6887 3. Restricted Evaluation 0.6164 . .. ... . .. . . 0.4167 . 0.6988 a) .. 0.6629 . 13,669 . Recall the case for “gall” required sentence Paragraphs ......... . .. . .. . .. . Recall sentence 5. Pull Text Sections or Text Documents .... ...................... The 4. output With 3. Restricted Sentence Sentence sys- by an additional 2 percent matching system. tem is presented in the first 3 columns of figures of Table 2(a). The results of Table 2(a) are average recall and precision figures obtained by using as queries all encyclopedia articles starting with the letter G, and comparing them with all other encyclopedia articles. A global similarity threshold of 0.20 is used, and all query articles are included in the evaluation for which the relevance Retrieval assessors found at least one relevant document. This represents a total of 982 “G” queries. Full relevance documents restricting of Text for the restricted Excerpts The global/local text matching the previous section is capable -JL sentence strategy described in of retrieving relevant with a high degree of accuracy. However, the retrieval to full document texts presents Query: [9562] Gall Ret rieved Document Query Number Similarity 12134.c1O 0.32 Insect/Reproduction 11847.p6 0.32 Hymenopteran 3. 17675.P6 0.28 Parasite 4. 12134 0.27 Insect ● 0.25 Parasite/Parasitic 0.24 Parasite ● 7. 11847 0.24 Hymenopteran b 8. 9. 17061 0.23 ● ● 22077 0.23 Oak Tannins ● ● 10. 11. 18414 22034 0.23 0.23 Plant ● ● Tamtarisk ● ● 12. 11913.p3 0.22 Ichneumon 13. 22075 0.21 Tannic 14. 11913 0.20 Ichneumon 15. 4953 23827.c4 0.20 Chalcid 0.20 Wasp/Characteristics Most of the long, discursive Plants Fly ● Acid Fly obviously, documents the users will that can be reduced enhanced text passages instead ● ● ● ● ● ● ● ments, with be retrieval ● . ✎ by making covering paragraph items. cover a number which tion to retrieve sections, 43 text paragraphs article describing 0.32, instead of 0.27 for the full the material on insects 1. The section only, whenever unretrieved is replaced insect by sec- reproduction, retrieval relevant from document. retrieval rank This lifts 4 to rank system also raises a previously item above the retrieval threshold of 0.20. Table 3 shows that the article on Wasps [23827] was originally rejected because of the low query simi- system be used which text contains 10 of that and particularly the reproduction of gall wasps. Furthermore, the query similarity for vector [12 134.c1O] is and the retrieval it possible for retrieval column The move to section retrieval promotes a number of retrieved items. Thus, the full article on Insects [12134] of full documents text decomposition the right-most as well as sections and full text query similarity the query similarity of a text excerpt is larger th~an the similarity of the complete document. This suggests that considers ● Effect of Retrieval of Text Excerpts (Sections and Paragraphs) for Query [9562] Gall (ci := section i; pj paragraph j) topics will be low, implying that many long will be rejected, even when they contain rel- The user overload ● .............. evant passages. successively ● ● 17675 occurs because the global a hierarchical output 17675.c5 problems. effectiveness Sections ● overloaded rapidly when large document texts are involved, Second, a potential loss in retrieval effective- of different documents Mat ch 5. Table 3: ness (recall) Paragraph ● .................................... ..... 23827 0.19 17. Wasp two main With 6. 16. and output Sentence Tit le 1. 9 -. Section Restricted larity text of 0.19. The query similarity for Section 4 of that paragraphs, and sets of adjacent text sentences. In each case, the text excerpt with the highest query similarity may be presented to the user first, while providing op- article ([23827.c4] ) reaches 0.20, leading to the retrieval of that relevant item. Additional relevant items are promoted when para- tions for obtaining graphs are added to the retrieval process as shown in the right-most column of Table 3. Thus, [11913] Ich- of the local larger or smaller text pieces. Because context checking requirement used for re- neumon trieval purposes, text excerpts such as paragraphs and sentences are already individually accessible, and the additional resources needed for passage retrievid poses are relatively modest. The basic operations of a section and paragraph Fly is lifted Hymenopteran pur- tion from rank 5 on Parasitic Plants re- The further 12, [11847] ([17675 .c5]) is replaced paragraph 6 of that document from rank 5 to rank 3. trieval capability are illustrated in Table 3 for the query [9562] Gall, previously used in the illustration of Table 1. The left-most column of bullets in Table 3 corresponds to the restricted sentence match retrieving full documents only (equivalent to the right-most column of bullets in Table 1). In the middle column of Table 3, text sections are retrievable in addition to full docu- 14 to rank is raised from rank 7 to rank 2, and secby ( [17675 .p6]) and raised effects of the passage retrieval capability are detailed in the example of Table 4, covering a retrieval run for query [9561] on Galileo. Here all items retrieved by the restricted sentence match are relevant to the query, but the section and paragraph retrieval strategies are able to locate large numbers of previously unretrieved relevant items. A comparison of !53 Query: [9561] Galileo Restricted Number Document Similarity 1. 18179.P72 0.58 Philosophy 1 ● 2. 12083.P6 0.49 Inertia 1 ● 3. 2501.P5 0.45 Bellarmine, 1 ● 4. 1640.P22 0.45 Astronomy 1 0.44 Physics/History 2 ● 1640. c8 0.43 Astronomy 3 e 7. 6165.P16 0.39 Copernicus 8. 18179.c25 0.36 Philosophy 4 9. 20683.P16 0.35 Science 1 10. 11. 6165.P17 6165.c7 6284.P5 0.32 0.31 Copernicus, N. 1 Copernicus, N. 0.31 cosmology Piss 4 1 18234.c8 6. 12. Paragraphs Title 0.30 18353.P6 13. 14. 15. 6284.c4 0.30 16. 20683.P13 0.28 Science Copernicus, Science N. Copernican System Bellarmine, Physics St. 17. 6165 0.28 18. 20683 19. 6164 0.27 0.27 20. 2501 text 0 ● ● 1 ✎ 1 2 e 1 e ● 13 e 2.5 ● o 4 ● ● 6 1 e ● 6419.c5 0.26 Creation 3 e 23. 18234.c9 0.26 2 e 24. 25. 20686.P6 19463.P17 0.24 0.24 Physics Scientific 26. 27. 22223.P5 1640 0.23 0.23 28. 20686 Renaissance Telescope 1 1 0.22 29. 1396.c7 0.21 Aristotle 30. 1636, D6 0.20 Astrology 1 12132.;16 0.20 Inquisition 1 Tot al Number Total Retrieved Avg. Number of Retrieved Paragraphs of Paragraphs Effect of bullets retrieved items two text in Table full-text plus 8 text items sections, column (39 paragraphs), are subdivided ● Per 6 95 15.8 of Text Excerpts z ; Pj = paragraph 12 65 5.4 for Query [9561] Galileo of Table 4. When full documents the average number by of paragraphs (see the middle trieved item is nearly and by one full- drops to only a little and 17 text paragraph items, paragraphs item in in the section on average and section 2; 25 1.3 j) bottom the ● e Item are replaced ● e Items 4 shows that sections covering e e 39 8 2 Method of Retrieval section retrieval), * . 1 Astronomy Scientific retrieval in addition to full texts. All the long, originally retrieved Astronomy paragraphs), e 18234.P19 Method ● * 21. 22, representing the right-most ☛ 0.27 0.26 columns item, Robert output Sections 1 N. Creation Cosmology (ci = section columns Robert 0.30 Table 4: four full text St. 6419.P1O 31. six originally Section and Paragraph Query 5. the three output With Sentence Match of Retrieved for the are retrieved, contained 16 for the Galileo over 5 paragraphs mode, and to only combined section in each requery. This per retrieved 1.3 paragraphs plus paragraph mode. For the 982 query articles that start with the letter G, the average reduction in retrieved text length such as [1640] is 43 percent for the section retrieval, and [20683] Science (25 in the passage retrieval paragraph retrieval. and 54 percent The users necessarily benefit for when shorter text excerpts are obtained that are immediately related to the query topic instead of only full-text items. The effectiveness of the section and paragraph re- mode, and the retrieved sections and paragraphs are much more closely related to the query topic [9561] Galileo than the original full articles. Furthermore, a large number of long, originally rejected articles are retrieved in the passage retrieval mode, including very long items such as the documents on Physics [18234] trieval is reflected the two right-most by the evaluation data appearing in columns of Table 2(a), covering section, and section plus paragraph retrieval, respectively, The recall and precision figures show that the passage retrieval techniques furnish substantially better output than the standard restricted sentence match for full doc- containing 146 paragraphs of text, and on Philosophy [18179] containing 117 paragraphs. Many shorter relevant items are also obtained for the first time in the passage retrieval mode, including [6284] Cosmology (30 paragraphs), [12132] Inquisition (21 paragraphs), and [19463] Renaissance (27 paragraphs). The efficiency improvements provided by the section and paragraph retrieval modes are illustrated at the uments alone. The average precision data of Table 2(a) show that the section output (run 4) is about 9 percent better than the restricted sentence match of run 3. This indicates that the vast majority of the 1,000 or so additional 54 documents obtainable through the section output are indeed relevant, Tables 3 and 4. Unfortunately, as shown earlier are understated in Table retrieval 2(a). obtained system Full relevance effectiveness output. shown (run 5) The improvement in the right-most in data were available the apparent for these items. deterioration and paragraph in performance outputs (runs of the for which the most sections and text queries complete evaluation previously full relevance included stantially be- outlined closely matching in answer discussion, to availrelatively the “effect of Galileo’s thought of the times”, and methods variable-length must retrieval depending be introduced output on particular at various user requirements. data, for 90 data were available. paragraphs In the earlier narrowed, for providing 4 and 5 of in Table strategies or “the effect of gall wasps on the health of oak trees”. In these circumstances the search context must be sub- This levels of detail 2(b) contains “G” retrieval are used to retrieve specific contexts, for example, discoveries on the philosophical Table 2(a). Table Re- general query statements were used to specify the retrieval context, covering, for example, “the life and works of Galileo”, or “characteristics and causes of galls in plants”. In many practical environments, it is desirable to extend the retrieval capability to much more (column trieval of [12083.P6] Inertia and [18353.P6] Piss in Table 4, are all treated as nonrelevant in the evaluation since tween section Passage earlier able query articles, are promoted by the section and paragraph retrieval system. The newly retrieved, originally rejectedl, items obtained in the paragraph mode, exemplified by the re- explains Automatic The section and paragraph text but not of Table 2(a) is then due entirely to changes in the retrieval ranks of originally retrieved relevant iterns that no relevance Flexible by irlforma- only for the section retrieval, for section plus paragraph retrieval improvements and paragraph tion was available Toward of trieval the actual the full section in the examples One possibility 2(a) consists in using search strategies pable of isolating Table 2(b) variable-length text fragments cafrom shows that the actual improvement to be expected with the section output is about 10 percent over the re- a variety of related documents, to be assembled into retrieval units covering the desired subject areas in de- stricted tail. Consider, as an example, a request dealing with the philosophical influence of Galileo. It seems reasonable sentence improvement tional match, with an additional for the paragraph 8 percent improvement output. If that were applied put of Table 2(a) for the complete 8 lpercent addi- to the out- query set, the average set of 982 queries. These extrapolated fig- The performance and text of the global/local systems of the extrapotext that [9561]) as a query, and cle [9561] Galileo whose global similarity with [18179] Philosophy exceeds 0.10. Table 5(b) similarly contains three-sentence passage retrieval (document context, at a time, it is possible to isolate n-sentence fragments that closely match a given query context. Table 5(a) shows some typical three-sentence fragments from arti- that sections and paragraphs retrieved by the passage retrieval system are indeed relevant to the query topic, the accuracy is, us- [18179]) in the Galileo then of the Galileo article in the context of Philosophy. By using a “moving window” containing n sentences ures are shown in parentheses to the right of column 5 in Table 2(a). The Galileo example of Table 4 shows confirming to some extent lated performance figures. article (document ing the Galileo article precision for run 5 would increase to 0.9280, giving an advantage of nearly 25 percent over the base run 1 for the complete to carry out a dual search first of the Philosophy tained mi~tching To generate can be assessed by fragments from in response to query answers [18179] Philosophy ob- [9561] Galileo. in response first various sources (articles [9561] Galileo and [18179] Philosophy in the present case). Sample similarity mea- retrieves a large number of items, many of which are deleted by the local sentence-matching process. The local text match thus acts as a precision device that rejects many previously ducing the total retrieved The passage retrieval retrieved nonrelevant related of Galileo’s queries noting the variations in the figures given in the first and last rows of Table 2. The unrestricted global text match be useful to combine impact to specific about the philosophical fragments work, it may obtained from surements between the three-sentence fragments of Table 5 are shown in Table 6(a). An examination of the fragment texts reveals close relationships in many cases. items re- from 17,328 to 11,296 overall. For example, system then acts as a recall device fragments the sun. These fragments would therefore be combined in answer to an appropriate query. Similarly fragments prove retrieval the unrestricted s49-51 from [9561] Galileo and s251-253 from [18179] Philosophy specifically deal with the role of Galileo in by about match. 25 percent over theory both deal with [9561] and of retrieved items from about 11,300 to about 13,200. The combined effect of these procedures appears to imeffectiveness global text [18179] article s234-236 Copernican from 516-18 from by adding text excerpts from relevant items that were not obtained earlier, and increasing the total number for the movement aspects of planets of the around shaping the philosophical thought of his time by introducing experimental approaches in physics. The text passage constructed from the text fragments used in 55 the last example is reproduced in Table 6(b). 12. U. Hahn and und Architektur TOPIC, Report Such a passage may be generated automatically in answer to queries about the philosophical contributions of Galileo. Detailed experiments of this type remain with text to be carried stanz, Germany, passage construction out. 13. U. Reimer References University The Automatic Creation of Literature Abstracts, IBM Journal of Resea?’ch and Deuelop ment 2:2, April 1958, 159-165. 1. H.P. Luhn, of the ACM, 4:5, May 1961, Problems in Automatic Ab3. H.P. Edmundson, stracting, Communications of the ACM, 7:4, April 1964, 259-263. Abstracting and Indexing tive Abstracts 5. L.L. Experiments Criteria, Infer- in Automatic turing trieval Research, van Rljsbergen terworths, Phrases, R.N. Oddy, and P.W. London, opment editors, 1975, Answer Passage Retrieval of the American Science, 32:4, July Society 155- by Text for articles, But- Data Retrieval by Text Searching, 10. J. O’Connor, Journal of Chemical Information and Computer Sciences, 17, 1977, 181-186. Journal Machinery, the article 164. Searching, ~ormation Global Text Automatic – Experiments Searching, Conference C!.J. Retrieval of Answer Sentences and 9. J. O’Connor, Answer Figures by Text Searching, Information 11. J. O’Connor, Publishing Com- Matching Science 253:5023,30 and C. Buckley, for cross-reference 1981, 172-191. 11:5/7, Retrieval, in Information cyclopedia Re- S.E. Robertson, and Management, Wesley Au- Text In- 1980,227-239. 56 entitled Struc- in Automatic Proc. Fourteenth Int. on Research and DevelRetrieval, New York, Association ranging articles “United for 1991, 21-30. 19, Funk and Wagnalls New Encyclopedia, Wagnalls, New York, 1979, 29 volumes, Literature Abstracts by 8. C.D. Paice, Constructing Computer: Techniques and Prospects, Information Processing and Management, 26:1, 1990, 171-186. Processing 1987. Philosophical Investigations, and Mott Ltd., Oxford, England, and Retrieval Computing in information Williams, December 1989. and C. Buckley, Encyclopedia ACM/SIGIR Generation of Literature 7. C.D. Paice, Automatic Abstracts – An Approach Based on the Identification of Self Indicating Wittgenstein, 18. G. Salton Retrieval, 2:4, 1958, 354-361. search and Development, as MIP-8723, gust 1991, 1012-1015. Index for Technical IBM Journal of Re- Man-Made 6. P.B. Baxendale, Literature – An Experiment, MA, for Information Extracting and Addison 17. G. Salton of 1964, 260-274. Storage and Indexing, Information 6:4, October 1970, 313-334. Condensation Report Text Processing – The Transand Retrieval of Information pany, Reading, Basil Blackwell 1953. of IndicaJournal Automatic Analysis, by Computer, 16. L. Automatic of Contextual Coherence July-August 22:4, Earl, – Production by Application ence and Syntactic the ASIS, and A. Zamora, Text Base Abstraction, of Passau, Germany, 15. G. Salton, formation, 226-234. 4. J.E. Rush, R. Salvador, 1985. 14. J.W. McInroy, A Concept Vector Representation of the Paragraphs in a Document Applied to Automatic Extracting, Report TR78-001, Computer Science Department, University of North Carolina, Chapel Hill, NC, 1978. and R.E. Wyllys, Automatic Ab2. H.P. Edmundson stracting and Indexing – Survey and RecommendaCommunications April and U. Hahn, Knowledge tions U. Reimer, Entwurfsprinzipien des Textkondensierungssy stems TOPIC 14/85, University of Kon- in length Funk and 25,000 en- from one line to 150 pages of text States of America”. for Similarity Sentence Coefficient Adjacent Numbers S16-18 At Padua, ical Galileo problems. Copernican s28-30 Professors there. Theory Galileo rejecting the planets Galileo’s revolves a fixed exist with on floating earth circle Galileo’s could also disputed to careful around He showed the sun–to interest in (see Astronomy: the Aristotelian and becanse Aristotle and that Four nothing had held new could and Piss over printed attacks that hydrostatics, on this only and he book followed, physics. interference stands a) Typical beyond 15th and a tendency anticipated the earth 16th toward the work moved centuries of the around he also conceived Polish of the universe Giordano Bruno, who similarly implications of the Copernican The work Galileo scientific of Galileo brought laws, the principles This as infinite identified theory. to the motions b) Typical Three-Sentence Obtained Table 5: Typical (s= and identical Three-Sentence in nature with mathematics science that philosopher the philosophical of a new world view. to the formulation of mechanics, which of applied of hodies. Fragments in Response Fragments sentence) 57 from [18179] to [9561] Galileo Extracted 0.1870 of the universe; The Italian developed by of Cuss in his suggestion in the development the Nicholas the center God. God, was accompanied prelate from with of applying by creating of geometry Galileo Copernicus humanity importance [9561] Catholic Nicolaus the universe to the importance he accomplished new vistas in astronby philosophical and Philosophy interest Roman displacing was of even grenter attention The astronomer thus from to [18179] of scientific mysticism. the sun, Fragments in Response a revival pantheistic 0.1912 science. Three-Sentence Obtained In the 0.1958 ever appear Galileo’s most valuable scientific ccmtribution was his founding of physics on precise measurements rather than on metaphysical principles and formal logic. More widely influential, theological s251-253 the of pendu- little theory however, were The Starry Messenger and the Dialogue, which opened omy. Galileo’s lifelong struggle to free scientific inquiry from restriction s234-236 discovered 0.1012 earth. at Florence in 1612, of mathemat- the motions the Copernican discoveries professors solution studied of materials. in the heavens bodies With measurements, of projectiles, and the strength scorned bodies a book physics path Sentences) for the practical in 1!595 he preferred )-that that of philosophy spherical published s49-51 beginning assumption perfectly speculative mechanics although Ptolemaic from “compass” and of the parabolic and investigated astronomy, The bodies Fragments(3 a calculating He turned law of falling lums, invented Sentence Philosophy from Related Documents 0.5800 Query Three-Sentence Three-Sentence Fragment Fragment [9561] [18179] Galileo Fragment Philosophy Similarity 18179 .s251-253 0.4484 9561 .s49-51 18179 .s251-253 0.4040 9561. s16-18 18179 .s234-236 0.2002 9561. s16-18 18179 .s251-253 0.1816 Typical The work a new Fragment Similarity Measurements Fragments of Table 5 of Galileo world view. was of even Galileo the motions greater brought mathematics to the formulation creating the science of mechanics, 9561. s49–51 Pairwise From 9561 .s28-30 a) 18179 .s251–253 From for importance attention to the in the development importance of of applying of scientific laws. This he accomplished which applied the principles of geometry by to of bodies. Galileo’s most valuable scientific contribution was his founding of physics on precise measurements rather than on metaphysical principles and formal logic. More widely logue, which scientific stands b) Extract influential, opened inquiry beyond Covering however, new vistas from restriction were The in astronomy. Starry Galileo’s by philosophical Messenger and lifelong struggle and theological science. “Galileo’s Role in Shaping Table 6: Text Passage Formed 58 Philosophical by Fragment Thought” Juxtaposition the Dia- to free interference

Approaches to Passage Retrieval in Full Text Information Systems

Related documents

Products

Support

Approaches to Passage Retrieval in Full Text Information Systems

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib