Project presentation REPTILE

advertisement
From: AAAI Technical Report SS-97-05. Compilation copyright © 1997, AAAI (www.aaai.org). All rights reserved.
Project presentation
REPTILE
Retrieval, Extraction, Presentation and Translation of Information using
Language Engineering
Kristofer
Franz6n 1 and Jussi Karlgren 2
1 Departmentof Linguistics, StockholmUniversity
S-106 91 Stockholm, Sweden
franzen @ling.su.se
2 SICS, SwedishInstitute of ComputerScience
Box 1263, S-164 28 Kista, Sweden
Jussi.Karlgren @sics.se
independent representations
about the content of
documentsbased on wordoccurrence statistics of different
kinds: performing a sort of shallow semantic analysis, in
effect. The primary research goal has been to create an
optimal representation for the general case.
Throughout the history of information retrieval,
however, the research community has been aware of the
fact that the interaction of information seeking users and
the tools to access information sources is important in
itself. Information can be sought for various reasons and
with various ideas of how to determine what documents
are relevant. (Luhn, 1957).
This project builds on the premise that it is necessaryto
find more information from texts and their users than a
shallow approximation of text topic. Texts have, besides
content, style, ecology, and internal structure, all of which
can be automatically identified from a text base and its
usage statistics and used for text categorization; and users
have information seeking strategies that can be recognized
through user studies and supported through interface
design.
Project Status
This presentation will give the outlines of a project in its
starting stages. Our research group, with participants from
the universities in LinkSping and Stockholm, Sweden, is
defining a large project aimed at constructing an
interactive tool for multilingual retrieval, analysis,
summarization and translation of documentsin large text
databases. Our purpose with this early presentation is to
get criticism and suggestions that will save us from
committingat least someerrors.
Base in Existing
Technology
Weintend to apply already existing approaches and use
existing systems as far as possible. Weexpect to need to
modify and develop someof the algorithms and techniques
further to handle the multilingual interactive setting we
plan our prototype to work in. The most important
motivation for further development is that we work
primarily with a language that is not English. Present
methods are mostly based on English systems with minor
adaptations. These may work well, but do not makeuse of
language specific features that might be exploited for
retrieval purposes. Webelieve that there are some
interesting possibilities to be found here, e.g., in the
considerably more complex morphology of Swedish as
comparedto English (K~tUgren, 1984).
General Solutions
Combining
Knowledge
Sources
In the information retrieval field claims have recurrently
been made that the prevailing predominantly statistical
approaches have reached an asymptote and that a fresh
look, and new and different approaches, are necessary to
substantially improveresults. So far, statistical methods
have shownno great vulnerability to these claims: added
linguistic methods tend to show promise, but not enough
to warrant the expenses in computation and memorythey
incur. However,in the last few years, with the advent of
reliable low-level linguistic analysis, the utility of
linguistically
based methods at last seems to have
are not Sufficient
Research in information retrieval and documentanalysis
has traditionally concentrated on building general, task-
37
improved. In addition, method testing has been made on
English, rather than on morphologically richer languages
which might afford somebetter purchase to the rather low
level of linguistic information that the systems in question
attempt to extract: tentative syntactic role, valency, clause
transitivity, e.g. An especially crucial question is howto
merge information from various knowledgesources into a
system based mainly on word occurrence statistics - which
still is and will remainthe single most efficient retrieval
cue. This is something we have worked with to some
extent in past Text Retrieval Conferences(Strzalkowski et
al, 1997).
Non-topical
Retrieval
Moreand more texts becomeavailable on the Internet but
at the same time they become more anonymous in the
sense that they all basically look the sameto the eye. This
reduces somedimensions of the information we use, e.g.,
whenlooking for books in an ordinary library. Youcan no
longer tell if somethingis a text of a kind that wouldbe
printed in a refereed journal or in half binding, or if it is a
text that somebodyhas just put out on the net fresh from
his/her owncomputer. There is also often no visible way
to tell if the text aims at beginners or specialists in its
field. This uniformity problem grows even larger when we
get in touch with texts in languages that we knowpoorly
or not at all. To address this problem,we performstylistic
analysis of texts. Recent experiments (Karlgren and
Cutting, 1994; Karlgren, 1996; Karlgren and Straszheim,
1997) indicate that this is not only feasible but can be
quite helpful. Someresults from these experiments will be
presented and discussed at the symposium.
The ecology of texts (Walker 1991) can be as important
to users as the content and genre. Whoreads a document
is an important factor in deciding which to choose among
several candidates. Social filtering, the techniqueof using
text ecology or text usage as a discriminant, is easily
applied to areas as diverse as Usenet Newstexts, Usenet
News conferences, music albums, video movies, and
novels (Karlgren, 1990; Karlgren, 1994).
Subtopic
Retrieval
Most IR systems treat texts as homogeneous units,
disregarding factors like length and internal structuring.
Oneway of compensatingfor this is to try to divide longer
texts into shorter segments. This segmentation becomes
really interesting whenit is not done mechanicallybut on
the basis of results from somekind of content analysis,
e.g., subtopic structure. Workalong these lines has been
carried out on English texts (Hearst 1997), and we will
report on our experiences of applying the same methodsto
38
Swedish texts. Such methods have already been tried for
finding relevant stretches in longer texts and for
summarization.In a multilingual search situation, it will
also be necessary to first identify the languageused in the
found documents by means of some algorithm (Kikui
1996).
Crude Translations
on Demand
In connection with cross-language retrieval from a small
language such as Swedish, it becomes increasingly
importantto be able to translate in both directions, i.e., to
first translate the search query into somemajor languages
(possibly to be specified by the user) besides Swedishand
then handle the result set in an intelligent way. The
possibility of online, full translations of entire documents
is not to be expected and arguably not even desired. The
user will probably first want to have a look at some
translated descriptors, e.g., index terms, central sentences
or a short summarization, before deciding whether a
certain text is relevant. This gives us another reason to
examine the techniques for subtopic analysis in some
detail, viz., to automatically pick short and relevant
subtexts to makecrude translations of, in order to help the
user judge the relevance of the entire document.
Crude translations of this modestkind, vectored towards
a specific domain, presented to a knowledgeableuser, and
with a lowlevel of textual ambition, are quite feasible with
the technology at hand today. At present, however, we do
not even knowwhat a user wants, needs and can handle in
a cross languagesearch situation.
Situation-Specific
Aspects
In the field of information retrieval the research questions
have been admirably well reduced to their cleanest
possible form. This has helped establish a well-founded
body of research, but at the same time left several
important factors unstudied. User background, user
motivation, and situation specific factors have largely been
ignored in mainstream information retrieval research: the
idealized picture of an information retrieval session is that
the user needs can be reduced to a simple verbal query of
relatively short length.
Knowledgeabout people’s spontaneous search strategies
tells us that the picture is a lot more complex(Walker,
1991; Belkin, 1994; HOOk
et al., 1995; Hth3k, 1996). With
the increased availability
of electronically
stored
information there is an attendant influx of newusers with
various backgrounds and various and perhaps unforeseen
information needs. Sometimespeople do look for specific
documentsor for answers to specific questions, but quite
as often they do not know what their information need
really is and they maynot even have enough knowledgeto
formulate a first question that might lead them further.
Sometimes the user wants a whole set of documents to
cover the area of interest, sometimesone representative
document is enough. As the amount of available
information and the number of users increase, various
kinds of adaptations of the search process and of the
presentation of the retrieved set of documents becomes
more and more urgent. All this points to the necessity of
careful empirical studies of user needs and user strategies
and we intend to carry out such studies in parallel with the
developmentof the system (Hansen and Karlgren, 1996).
At this symposiumwe will present experiments with
seemingly language independent or language tunable
techniquesfromthe fields of stylistic analysis and subtopic
segmentation, and discuss the question of whether they are
really independent or in fact makeimplicit assumptionsof
the source language.
Karlgren, J. 1994. "NewsgroupClustering Based On User
Behavior - A RecommendationAlgebra", SICS Technical
Report T94:04, Swedish Institute of Computer Science,
Stockholm.
Karlgren, J. 1990. "An Algebra for Recommendations",
Syslab Working Paper 179, Department of Computer and
System Sciences, Stockholm. Stockholm University.
Karlgren, J., and Straszheim, T. 1997. "Visualizing
Stylistic Variation." In the Proceedingsof the 30th Hawaii
International Conference on Systems Sciences, January 710, 1997, Maui, Hawaii.
Karlgren, J.;
Genres with
Analysis". In
Computation
1g/9410008.)
and Cutting, D. 1994. "Recognizing Text
Simple Metrics Using Discriminant
Proceedings of COLING
94, Kyoto. (In the
and Language E-Print Archive: cmp-
References
Kikui, G. 1996. "Identifying the Coding System and
Language of On-Line Documents on the Internet".
In
Proceedings
of COLING-96, August 5-9, 1996.
Vo1.II:652-657., Copenhagen, Denmark.
Belkin, N. J. 1994. "Design Principles for Electronic
Textual Resources: Investigating
Users and Uses of
Scholarly Information". In Zampolli, A.; Calzolari, N.;
and Palmer, M. (eds.) Current Issues in Computational
Linguistics:
In Honour of Don Walker, Dordrecht:
Kluwer.
Hansen, P.; and Karlgren, J. 1996. "Designing Interactive
IR Combining Adaptation to User Tasks and Strategies
with Non-topic Document Analysis". In The DELOS
Kick-Off Workshop: Pre-Workshop Abstracts, SophiaAntipolis, France, March. ERCIM.
Hearst, M. A. 1997. "TextTiling: Segmenting Text into
Multi-Paragraph Subtopic Passages". Computational
Linguistics, March1997.
H66k, K. 1996. A Glass Box Approach to Adaptive
Hypermedia.Ph.D. thesis, Dept. of Computerand Systems
Sciences, StockholmUniversity,
H66k, K.; Karlgren, J.; and W~n,A. 1995. "A Glass Box
Intelligent Help Interface", Paper presented to the First
International workshopon Multi-modal Interfaces (IMMI1), Edinburgh.
Karlgren, J. 1996. "Stylistic Variation in an Information
Retrieval Experiment" In proceedings of The Second
International Conference on NewMethods in Language
Processing - NeMLaP2, Bilkent, September 1996.
Ankara: Bilkent University. (In the Computation and
Language B-Print Archive: cmp-lg/9608003.)
39
K/illgren, G. 1984. "Automatiskexcerpering av substantiv
ur 16pande text. Ett m6jligt hj~ilpmedel vid automatisk
indexering?" IRI-rapport
1984:1. Institutet
f6r
R/ittsinformatik, Stockholmsuniversitet.
Luhn, H. P. 1957. "A Statistical Approachto Mechanized
Encoding and Searching of Literary Information". IBM
Journal of Research and Development 1 (4) 309-317.
(Reprinted in H. P. Lnhn: Pioneer of Information Science,
selected works. Schultz, C. K. (ed.) 1968. NewYork:
Sparta.
Strzalkowski, T.; Guthrie, L.; Karlgren, J.; Leistensnider,
J.; Lin, F.; Perez-Carballo, J.; Straszheim, T.; Wang,J.;
and Wilding, J. 1996. "Natural Language Information
Retrieval: TREC-5Report" In Proceedings of the fifth
Text Retrieval Conference, Harman, D. (ed.), NIST
Special Publication, Gaithersburg: NIST.
Walker, D. E. 1991. "The Ecology of Language". In
Proceedings of the International WorkshopOn Electronic
Dictionaries.
Japan Electronic Dictionary Research
Institute. Also in Zampolli, A.; Calzolari, N.; and Palmer,
M. (eds.).
1994. Current Issues in Computational
Linguistics:
In Honour of Don Walker, Dordrecht:
Kluwer.
Download