From: AAAI Technical Report SS-97-05. Compilation copyright © 1997, AAAI (www.aaai.org). All rights reserved. Project presentation REPTILE Retrieval, Extraction, Presentation and Translation of Information using Language Engineering Kristofer Franz6n 1 and Jussi Karlgren 2 1 Departmentof Linguistics, StockholmUniversity S-106 91 Stockholm, Sweden franzen @ling.su.se 2 SICS, SwedishInstitute of ComputerScience Box 1263, S-164 28 Kista, Sweden Jussi.Karlgren @sics.se independent representations about the content of documentsbased on wordoccurrence statistics of different kinds: performing a sort of shallow semantic analysis, in effect. The primary research goal has been to create an optimal representation for the general case. Throughout the history of information retrieval, however, the research community has been aware of the fact that the interaction of information seeking users and the tools to access information sources is important in itself. Information can be sought for various reasons and with various ideas of how to determine what documents are relevant. (Luhn, 1957). This project builds on the premise that it is necessaryto find more information from texts and their users than a shallow approximation of text topic. Texts have, besides content, style, ecology, and internal structure, all of which can be automatically identified from a text base and its usage statistics and used for text categorization; and users have information seeking strategies that can be recognized through user studies and supported through interface design. Project Status This presentation will give the outlines of a project in its starting stages. Our research group, with participants from the universities in LinkSping and Stockholm, Sweden, is defining a large project aimed at constructing an interactive tool for multilingual retrieval, analysis, summarization and translation of documentsin large text databases. Our purpose with this early presentation is to get criticism and suggestions that will save us from committingat least someerrors. Base in Existing Technology Weintend to apply already existing approaches and use existing systems as far as possible. Weexpect to need to modify and develop someof the algorithms and techniques further to handle the multilingual interactive setting we plan our prototype to work in. The most important motivation for further development is that we work primarily with a language that is not English. Present methods are mostly based on English systems with minor adaptations. These may work well, but do not makeuse of language specific features that might be exploited for retrieval purposes. Webelieve that there are some interesting possibilities to be found here, e.g., in the considerably more complex morphology of Swedish as comparedto English (K~tUgren, 1984). General Solutions Combining Knowledge Sources In the information retrieval field claims have recurrently been made that the prevailing predominantly statistical approaches have reached an asymptote and that a fresh look, and new and different approaches, are necessary to substantially improveresults. So far, statistical methods have shownno great vulnerability to these claims: added linguistic methods tend to show promise, but not enough to warrant the expenses in computation and memorythey incur. However,in the last few years, with the advent of reliable low-level linguistic analysis, the utility of linguistically based methods at last seems to have are not Sufficient Research in information retrieval and documentanalysis has traditionally concentrated on building general, task- 37 improved. In addition, method testing has been made on English, rather than on morphologically richer languages which might afford somebetter purchase to the rather low level of linguistic information that the systems in question attempt to extract: tentative syntactic role, valency, clause transitivity, e.g. An especially crucial question is howto merge information from various knowledgesources into a system based mainly on word occurrence statistics - which still is and will remainthe single most efficient retrieval cue. This is something we have worked with to some extent in past Text Retrieval Conferences(Strzalkowski et al, 1997). Non-topical Retrieval Moreand more texts becomeavailable on the Internet but at the same time they become more anonymous in the sense that they all basically look the sameto the eye. This reduces somedimensions of the information we use, e.g., whenlooking for books in an ordinary library. Youcan no longer tell if somethingis a text of a kind that wouldbe printed in a refereed journal or in half binding, or if it is a text that somebodyhas just put out on the net fresh from his/her owncomputer. There is also often no visible way to tell if the text aims at beginners or specialists in its field. This uniformity problem grows even larger when we get in touch with texts in languages that we knowpoorly or not at all. To address this problem,we performstylistic analysis of texts. Recent experiments (Karlgren and Cutting, 1994; Karlgren, 1996; Karlgren and Straszheim, 1997) indicate that this is not only feasible but can be quite helpful. Someresults from these experiments will be presented and discussed at the symposium. The ecology of texts (Walker 1991) can be as important to users as the content and genre. Whoreads a document is an important factor in deciding which to choose among several candidates. Social filtering, the techniqueof using text ecology or text usage as a discriminant, is easily applied to areas as diverse as Usenet Newstexts, Usenet News conferences, music albums, video movies, and novels (Karlgren, 1990; Karlgren, 1994). Subtopic Retrieval Most IR systems treat texts as homogeneous units, disregarding factors like length and internal structuring. Oneway of compensatingfor this is to try to divide longer texts into shorter segments. This segmentation becomes really interesting whenit is not done mechanicallybut on the basis of results from somekind of content analysis, e.g., subtopic structure. Workalong these lines has been carried out on English texts (Hearst 1997), and we will report on our experiences of applying the same methodsto 38 Swedish texts. Such methods have already been tried for finding relevant stretches in longer texts and for summarization.In a multilingual search situation, it will also be necessary to first identify the languageused in the found documents by means of some algorithm (Kikui 1996). Crude Translations on Demand In connection with cross-language retrieval from a small language such as Swedish, it becomes increasingly importantto be able to translate in both directions, i.e., to first translate the search query into somemajor languages (possibly to be specified by the user) besides Swedishand then handle the result set in an intelligent way. The possibility of online, full translations of entire documents is not to be expected and arguably not even desired. The user will probably first want to have a look at some translated descriptors, e.g., index terms, central sentences or a short summarization, before deciding whether a certain text is relevant. This gives us another reason to examine the techniques for subtopic analysis in some detail, viz., to automatically pick short and relevant subtexts to makecrude translations of, in order to help the user judge the relevance of the entire document. Crude translations of this modestkind, vectored towards a specific domain, presented to a knowledgeableuser, and with a lowlevel of textual ambition, are quite feasible with the technology at hand today. At present, however, we do not even knowwhat a user wants, needs and can handle in a cross languagesearch situation. Situation-Specific Aspects In the field of information retrieval the research questions have been admirably well reduced to their cleanest possible form. This has helped establish a well-founded body of research, but at the same time left several important factors unstudied. User background, user motivation, and situation specific factors have largely been ignored in mainstream information retrieval research: the idealized picture of an information retrieval session is that the user needs can be reduced to a simple verbal query of relatively short length. Knowledgeabout people’s spontaneous search strategies tells us that the picture is a lot more complex(Walker, 1991; Belkin, 1994; HOOk et al., 1995; Hth3k, 1996). With the increased availability of electronically stored information there is an attendant influx of newusers with various backgrounds and various and perhaps unforeseen information needs. Sometimespeople do look for specific documentsor for answers to specific questions, but quite as often they do not know what their information need really is and they maynot even have enough knowledgeto formulate a first question that might lead them further. Sometimes the user wants a whole set of documents to cover the area of interest, sometimesone representative document is enough. As the amount of available information and the number of users increase, various kinds of adaptations of the search process and of the presentation of the retrieved set of documents becomes more and more urgent. All this points to the necessity of careful empirical studies of user needs and user strategies and we intend to carry out such studies in parallel with the developmentof the system (Hansen and Karlgren, 1996). At this symposiumwe will present experiments with seemingly language independent or language tunable techniquesfromthe fields of stylistic analysis and subtopic segmentation, and discuss the question of whether they are really independent or in fact makeimplicit assumptionsof the source language. Karlgren, J. 1994. "NewsgroupClustering Based On User Behavior - A RecommendationAlgebra", SICS Technical Report T94:04, Swedish Institute of Computer Science, Stockholm. Karlgren, J. 1990. "An Algebra for Recommendations", Syslab Working Paper 179, Department of Computer and System Sciences, Stockholm. Stockholm University. Karlgren, J., and Straszheim, T. 1997. "Visualizing Stylistic Variation." In the Proceedingsof the 30th Hawaii International Conference on Systems Sciences, January 710, 1997, Maui, Hawaii. Karlgren, J.; Genres with Analysis". In Computation 1g/9410008.) and Cutting, D. 1994. "Recognizing Text Simple Metrics Using Discriminant Proceedings of COLING 94, Kyoto. (In the and Language E-Print Archive: cmp- References Kikui, G. 1996. "Identifying the Coding System and Language of On-Line Documents on the Internet". In Proceedings of COLING-96, August 5-9, 1996. Vo1.II:652-657., Copenhagen, Denmark. Belkin, N. J. 1994. "Design Principles for Electronic Textual Resources: Investigating Users and Uses of Scholarly Information". In Zampolli, A.; Calzolari, N.; and Palmer, M. (eds.) Current Issues in Computational Linguistics: In Honour of Don Walker, Dordrecht: Kluwer. Hansen, P.; and Karlgren, J. 1996. "Designing Interactive IR Combining Adaptation to User Tasks and Strategies with Non-topic Document Analysis". In The DELOS Kick-Off Workshop: Pre-Workshop Abstracts, SophiaAntipolis, France, March. ERCIM. Hearst, M. A. 1997. "TextTiling: Segmenting Text into Multi-Paragraph Subtopic Passages". Computational Linguistics, March1997. H66k, K. 1996. A Glass Box Approach to Adaptive Hypermedia.Ph.D. thesis, Dept. of Computerand Systems Sciences, StockholmUniversity, H66k, K.; Karlgren, J.; and W~n,A. 1995. "A Glass Box Intelligent Help Interface", Paper presented to the First International workshopon Multi-modal Interfaces (IMMI1), Edinburgh. Karlgren, J. 1996. "Stylistic Variation in an Information Retrieval Experiment" In proceedings of The Second International Conference on NewMethods in Language Processing - NeMLaP2, Bilkent, September 1996. Ankara: Bilkent University. (In the Computation and Language B-Print Archive: cmp-lg/9608003.) 39 K/illgren, G. 1984. "Automatiskexcerpering av substantiv ur 16pande text. Ett m6jligt hj~ilpmedel vid automatisk indexering?" IRI-rapport 1984:1. Institutet f6r R/ittsinformatik, Stockholmsuniversitet. Luhn, H. P. 1957. "A Statistical Approachto Mechanized Encoding and Searching of Literary Information". IBM Journal of Research and Development 1 (4) 309-317. (Reprinted in H. P. Lnhn: Pioneer of Information Science, selected works. Schultz, C. K. (ed.) 1968. NewYork: Sparta. Strzalkowski, T.; Guthrie, L.; Karlgren, J.; Leistensnider, J.; Lin, F.; Perez-Carballo, J.; Straszheim, T.; Wang,J.; and Wilding, J. 1996. "Natural Language Information Retrieval: TREC-5Report" In Proceedings of the fifth Text Retrieval Conference, Harman, D. (ed.), NIST Special Publication, Gaithersburg: NIST. Walker, D. E. 1991. "The Ecology of Language". In Proceedings of the International WorkshopOn Electronic Dictionaries. Japan Electronic Dictionary Research Institute. Also in Zampolli, A.; Calzolari, N.; and Palmer, M. (eds.). 1994. Current Issues in Computational Linguistics: In Honour of Don Walker, Dordrecht: Kluwer.