From: AAAI Technical Report FS-94-02. Compilation copyright © 1994, AAAI (www.aaai.org). All rights reserved. Relevance in Textual Retrieval Carol Barry School of Library and Information Science Louisiana State University Baton Rouge, LA 70803 (504) 388-1495 voice (504) 388-1468 voice (504) 388-1465 fax (504) 388-4581 fax kraft@bit.csc.lsu.edu isbary@isuvm.sncc.lsu.edu Donald H. Kraft Department of Computer and Science Text processing has stimulated great interest over the last several years, prompted by technical advances in storage, searching, telecommunications, and user interfaces. The increasing generation of text causes problems in terms of storage and retrieval, and there are no signs of this trend abating in the future. This technology includes electronic publishing, computer networks (e.g., electronic mail and bulletin boards), full-text systems, image databases, and hypermedia systems. Moreover, the determination of which rules should fire in an expert system implies the need for relevance determination there, too. One major aspect of text processing is information retrieval, the determination of which of a set of documents or records should be retrieved in response to a user query for information. However, in spite of a variety of theoretical advances, including models, front ends, natural language processing, and artificial intelligence methods, as well as relevance feedback, improved retrieval system performance has been an elusive target. Information retrieval has been the subject of much research over the last several years, largely due to the imprecise nature of determining which textual records are relevant to user queries. One can view an information retrieval system as a set of records that are identified, acquired, indexed, and stored and a set of user queries for information that are matched to the index to determine which subset of the stored records should be retrieved and presented to the user. The index can involve descriptive information (i.e., bibliographic information in the case of textual documents, such as author, title, publisher), and content indicators (such as keywords or subject headings) to indicate the nature of what the document is "about." The use of controlled vocabularies, and possibly thesauri, versus free or natural language is an important issue in indexing in generating these terms. Indexing can be seen as a mapping from the set of documents and the set of keywords (terms or phrases) into a set of values (often either {0,i} or [0,i]) indicating how much a given document is "about" the concept(s) represented by the keyword. Often, relative frequencies are used to measure this aboutness. A later development was to add weights to the terms in the user query (generally in the same interval as the index weights) in order to indicate importance. 118 In processing a query, one can interpret the term weights as points in a vector space, probabilities of relevance, or as fuzzy set membership functions in order to be to induce a document ranking mechanism. When Boolean logic is added, it introduces difficulties in maintaining the Boolean lattice structure. Moreover, there are problems in developing a mathematical model of query processing to evaluate each record against the query that will preserve the semantics, i.e., the meaning, of the user query. For example, query weights can be interpreted as being importance weights, thresholds, or as a description of the "perfect" document. Despite these difficulties, as well as user difficulties in using it, Boolean logic is at the heart of most, if not all, commercial textual retrieval systems (e.g., Dialog, Medline, Lexus). It has been suggested that one enforce the property of separability, evaluating a document against each term in the query and then combining those evaluations according to the Boolean logic of the query. This will preserve the isomorphism between queries of one term and of Boolean expressions. Other research has incorporated artificial intelligence to develop front ends to aid in query formulation and search. These include using details about the users to better understand his/her information needs (e.g., an engineer would use the term "stress" differently than a psychologist). In addition, expert systems can be used to develop better thesauri and indexing methods. Further, relevance feedback can be incorporated to let users react to what has been retrieved and using the information about which documents the user deemed relevant to modify the query to improve the search. For example, work is underway to employ genetic algorithms to randomly search the query space for the "optimal" query. Evaluation is an issue that must be mentioned. Retrieval system performance is often measured by its ability to retrieve documents perceived to be relevant and to not retrieve documents perceived to be nonrelevant. Recall, the proportion of relevant documents retrieved, and precision, the proportion of retrieved documents that are relevant, are the two central measures used. However, the introduction of weights permits ranking of documents, so that these measures need to be extended to include partial relevance and rank. One aspect of system evaluation which has not been fully addressed is how relevance is to be defined and who is qualified to judge the relevance of information. The traditional approach to information retrieval system design and evaluation has been a "systems view" of relevance; in such a view, relevance is seen solely as a property of the internal mechanism of the system and relevance is viewed as a topical match between the subject terms used in queries and subject terms assigned to documents or records. Within the systems view, any qualified person can judge the relevance of information by simply comparing the topical aspects 119 of queries and documents or records. In recent years, there has been increasing recognition that relevance can only be judged by the user requesting the information, and that this evaluation is a cognitive, dynamic and subjective process which is influenced by factors beyond topical appropriateness of information. Recent research into user generated evaluations of relevance has identified several nontopical aspects of information which influence users’ evaluations. These include the user’s judgment of the recency of information; the extent to which the information is novel to the user; the user’s judgment of the quality or validity of the work; the extent to which the information supports or contradicts the user’s point of view; the user’s ability to understand the level of discussion presented; and so on. Research into user defined relevance has progressed to the point of identifying factors influencing users’ relevance evaluations and supporting the view of relevance as a userbased judgment which incorporates factors beyond the topical appropriateness of information. The next step, it would seem, is to begin exploration of the mechanisms and algorithms which might be incorporated into information retrieval systems to carry the retrieval process itself beyond topical matching. 120 References Barry, C. L., "User-Defined Relevance Criteria : An Exploratory Study," Journal of the American Society for Information Science, v. 45, 1994, pp.149-159 Bordogna, G., Carrara, C., and Pasi, G., "Query Term Weights as Constraints in Fuzzy Information Retrieval, " Information Processinq and Manaqement, v. i, 1991, pp. 15-26 Buckles, B. P. and Petry, F. E. (eds.), Genetic Alqorithms, Washington, DC: IEEE Computer Society Press, 1992 Froelich, T. J., "Relevance Reconsidered -- Toward an Agenda for the 21st Century: Introduction to Special Topic Issue on Relevance Research," Journal of theAmerican Society for Information Science, v45, 1994, pp. 124-134 Harter , S . P . , "Psychological Relevance and Information Science," Journal of the American Society for Information Science, v. 43, 1992, pp.602-615 Kraft, D. H., Bordogna, G. and Pasi, G., "An Extended Fuzzy Linguistic Approach to Generalize Boolean Information Retrieval," accepted by Journal of Information Sciences, 1994 Kraft, D. H. and Boyce, B. R., "Approaches to Intelligent Information Retrieval," in Delcambre, L. and Petry, F. E. (eds.), Advances in Databases and Artificial Intelligence, 1994 Kraft, D. H., "Advances in Information Retrieval:Where is That /#*%@^ Record?," in Yovits, M. (ed.), Advances in Computers, v. 24, Academic Press, New York, 1985, pp. 277-318 Kraft, D. H. and Buell, D. A., "Fuzzy Sets and Generalized Boolean Retrieval Systems," International Journal of Man-Machine Studies, v. 19, 1983, pp. 45-56 Kraft, D. H., Petry, F. E., Buckles, B. P., and Sadasivan, T., "The Use of Genetic Programming to Build Queries for Information Retrieval," IEEE Symposium on Evolutionary Computation, Orlando, FL, 1994 Miyamoto, S., Fuzzy Sets in Information Retrieval and Cluster Analysis, Boston, MA: Kluwer Academic Publishers, 1990 Park, T. K., "The Nature of Relevance in Information Retrieval: An Empirical Study," Library Quarterly, v. 63, 1993, pp. 318-351 Salton, G., Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer, Reading, MA: Addison-Wesley, 1989 Schamber, L., Eisenberg, M. B. and Nilan, M. S., "A Reexamination of Relevance: Toward a Dyanamic, Situational Definition," Information Processing & Management, v. 26, 1990, pp. 755-766 121