From: AAAI Technical Report SS-97-05. Compilation copyright © 1997, AAAI (www.aaai.org). All rights reserved. Multilingual functionality in the TwentyOneproject Wessel Kraaij NetherelandsOrganizationfor AppliedScientific Research(TNO) Institute of AppliedPhysics POBox155 2600 ADDelft The Netherlands kraalj @tpd.tno.nlor kraaljw@acm.org SustainableDevelopment The nameof the project refers to the UNconferenceon this topic in Rio de Janeiro 1992.Theaimof the project is to build a systemthat supports and improvesdisseminationof informationabout ’local agenda21’ initiatives. Abstract TwentyOne is a EUfunded project which aims at developing advanced indexing and retrieval techniques for multimedia document bases. The document base consists of documents in four languages: Dutch, English, French and German. This paper focusses on the multilingual aspects of the project: cross-language retrieval, partial document translation techniques and automatic hyperlinking between sour ce text and translations. Introduction 12 is a project fundedby the EUTelematicsApTwentyOne plication Programme. Project partners include academic partnerslike the Universitiesof Twente andTiibingen,companieslike Getronicsand Xerox,contract research organistations like TNO and DFKIand non-profit environmental organisationslike Friendsof the Earth. Theproject can be characterisedby the followingkeywords: Document conversionThe TwentyOne system alms at the disclosure of documents of different mediatypes and/ or data formats e.g. paper documents,WEB documents, wordprocessordocuments,text annotatedimages,audio or videomaterial. Knowledge based disclosure The q~ventyOneMultimedia documentbase will be disclosed using several advancedtechniques like fuzzy matching, NLP-based phraseindexing,relevancerankingand automatichyperlinking. MultilingualityTheTwentyOne database consists of documentsin different languages,initially Dutch,English, French and Germanbut extensions to other European languagesare envisaged. DisseminationModelThe environmental partners develop an informationtransaction modelwhichworkslike a perpetuummobile. Both information providers and seekers profit fromthe model,the formerby increasing the number of potential customers,the latter because moreinformationbecomesavailable. Theproject tries to stimulateinteractionandraise awarenessof local agenda 21initiatives in Europe. ApplicationorientedThe most important deliverable of the project is the profiling systemwhichproducesan index on the multilingual multimediadocumentbase. This index will be available via CDROM andaccessible via a WEB server. BecauseqlventyOne is fundedby an application oriented programme, the project has only limited resourcesfor fundamentalresearch, Someaspects of the systemtouch open research problemsthough. Theconsortiumhas assessedthe state of the art of technology in theseareas (e.g. cross languageinformationretrieval). Becausea first versionof the systemhas to be finished by the end of 1997, wetake a pragmaticapproach,by integrating available tools andresourcesanddevelopingsolutionsfor missinglinks. This paper focusses on the multilingual functions of qkventyOne. Theyare threefold: . 2. 1 Thispaperdescribesjoint workwithcolleagues at TNO-TPD, Universityof Twente, DFKI,XeroxandUniversityof Tflbingen 3. 2The TwentyOne homepage can be found at: http: //www. tno.nl/ 127 retrieval of documents in anotherlanguagethan the query language(CLIR),supported languagesare Dutch, English, Frenchand German (partial) translation of documentsto enable content judgementby the user Automatichyperlinks between index terms and their translations (alignedmultilingualdocuments) Froma research perspective, attacking four languagesat once complicates things considerably. Scalability of the system and separation of language dependent from language independent resources becomesmore important than in the two-languagecase whichhas been investigated in detail, especially in the last few years. A comparablepredecessor is the ESPRITII project EMIR(Fluhr & Radwan 1993) which covers a subset of the TwentyOnelanguages: English, French and German.EMIRis based on the SPIRIT ranked boolean engine combinedwith a multilingual thesaurus as front-end. EMIRis currently being extended to Russian. Other, morerecent, initiatives with a comparable objective are: ing indexingtime (off-line) or as a pre-processingstep the retrieval process(on-line). 2. The translation process can be based on three sources of transfer knowledge: (a) MTsystems (b) Bilingual dictionaries or thesauri (c) Parallel corpora Wewill discuss all possible combinations of the approaches and resources. Query translation (on-line translation) Dictionary based approach Simple word by word translation of the query terms has been evaluated in e.g. (Hull Grefenstette 1996). It is the most simple approach to CLIR as ambiguityis left unresolved: each (lemmatised)word substituted by all its possible translations. Twoproblems are prominent: 1. MULINEX (Erbach 1997) : A Multilingual search engine for German,French and English 2. TITAN(Hayashi 1997): A search engine to search in English Webpages with a Japanese query (Davis & Ogden 1997) A search engine 3. MUNDIAL search in Spanish Webpages with an English query 1. Polysemy: Translation of queryconceptsis likely to decrease precision when the word sense cannot be disambiguated. Example: the Dutch word "slag" can be translated to both "battle" or "stroke". On the other hand, if more than one equivalent translation is available, translation could increase recall, because synonymsare added to the query. Hull proposes to use a ranked Boolean query model as a possible way to cope with this problem. In this modeldocumentsare ordered on the numberof (translations of ) query concepts that are matched. This modelwill probably not work so well for short (1-3 term) queries. Because documents that matchonly one query concept have a high probability of being totally off topic whenthat query term has multiple translations. In a moreabstract sense, the documentdatabase itself is used as a disambiguatingfilter, but the windowsize of the filter is rather large i.e. the full size of a document.This methodcould probably be enhanced by restricting the windowsize, which would require storing position information of each wordin the index. It’s obvious that developmentof full documenttranslation software is far beyondthe scopeof the project. Therefore we have planned to evaluate available commercialsystems and develop supplementary shallow term translation moduleswhere needed(i.e. for missing languagepairs). The automatic hyperlinking function attaches typed hyperlinks between terms, phrases or images etc. These links can be either static (generated off-line) or dynamic, in which case a link is evaluated by a CGI program. We haveplanned to generate hyperlinks for all translated NounPhrases, which makesit easy for the user to jumpbetween translated and original text. In this paper we will concentrate on CLIRand partial documenttranslation, because these functions can be combined in several aspects. Wewill first present someresults from CLIRexperiments which have inspired the design. Subsequently we will discuss the TwentyOneapproach to CLIRand Documenttranslation which is also influenced by the availability of linguistic resourceslike bilingual dictionaries. 2. Multi word expressions(MWE’s) Idiomatic expressions, terminology, collocations are a notorious problemin CLIR.Wordbased translation falls here because often the meaning of the MWE is not compositional, e.g. yellow pages. A terminologyor idiomatic dictionary can only partly leverage the problembecause most of the MWE’s are highly domainspecific. Concise overview of approaches to CLIR Wewill present the possibilities for CLIRin a slightly dif3 than the one used in the overviewarticle ferent taxonomy by Oard(Oard &Dorr1996). CLIRsystems can be classified in two ways: 1. The stage in the disclosure process at whichthe language transfer takes place. Translation can be doneeither dur3This taxonomyhighlights the first dilemmain CLIRsystemsdesign:either on-line querytranslation or off-line document translation 128 MTbased approach Typical queries in current popular IR systems like"Websearch engines" tend to be very short. Therefore the advantage of MTsystems (which in principle can exploit syntactic and semanticaspects of context to improve translation) with respect to dictionary based approaches is questionable. On the other hand, for longer queries (Query by example, search similar documents) could yield good results. The EMIRproject has compared SYSTRAN query translation with thesaurus based translation, averageprecision of the latter systemturned out to be muchbetter. tion. Wewill sketch a possible approachin a later section. Partial translation of nounphrases for presentation purposes has to meet higher requirements than the query translation case: getting the wordsenses right is not enoughbecause wordorder and inflection have to be correct in order to makethe translation readable. This step requires syntactic and morphological knowledge. Corpusbased approachParallel corpora implicitly encode a lot of transfer knowledge. This knowledgecan be exploited in different ways: The TwentyOne approach In this section we will discuss the design choices we have madein order to build a systemwith the three multilingual aspects whichwereintroducedearlier. Wewill start by listing the relevant resources. 1. Derivingbilingual dictionaries from aligned corpora. Especially domainspecific aligned corpora are of great value to infer translations of or at least identify MWE’s. Theseare of key value for CLIRbut can’t be dealt with by simple word-basedtranslation. In fact this is also a dictionary based approach. 2. Store dual-language documentsin a dual-language vector space, PerformLatent Semanticindexing on the dual language documents before folding in the monolingual documents.The LSI space captures a "multi-lingual semantic space" on which the monolingual documents are mapped.Positive results are reported in (Dumals,Landauer, &Littman 1996). An advantage of this approach is that alignmentof the parallel coporais only necessary on the documentlevel. Availability of Resources Bi-or multilingualdictionaries Wehave contacts with two Dutch publishers. The material is either a collection of bilingual dictionaries fromDutchto the other languages or a multilingual thesaurus, including morphological information. The lexical database even contains translations of idioms and collocations, which might be extremely valuable. Wedon’t know about MRD’sof publishers in other Europeancountries whichoffer translation to and from Dutch EU materials The EU has published the EUROVOC thesaurus, a collection of commonlyused terminology in EUdocuments.The thesaurus is electronically available Documenttranslation (off-line) MTbased full translation If we translate all documents to the query language, than CLIRis reduced to a monolingual IR case. A disadvantage of the approach is the dependencyon imperfect MTsystems which are often closed monolithic systems with (probably) limited coverageof domain terminology. Another disadvantage is that MTsystem deliver only one translation in case of synonyms.Anadvantage howeveris that the translated documentscan also be used for presentation to the user, which makessense when translating from languages of which the user even has no passive knowledge. Machinetranslation of complete documents is obviously moreworthwhile than translating short queries, because the MTsystem can use the whole document as context. Dumals (Dumals, Landauer, & Littman 1996) reported favourable results of documenttranslation by SYSTRAN in combination with monolingual LSI. Parallel texts Wehave the official "Agenda21" conference documentin all the four languages. Weare still trying to find parallel texts at EUor UNinstitutions. CommercialMTsoftware Recently a survey of these tools (examples can be found on the WEB)has been started at DFKI. Weare not aware of a commercially available MTsystem which supports the four languages supported by TwentyOne Monolingual IR system TwentyOnewill use the monolingual IR kernel of TNO-TPD which supports: ¯ Vector Spaceretrieval ¯ Booleanretrieval ¯ Fuzzy matching Partial translation techniques Because most Indexing models are based on lemmatised content words, a CLIR system could be based on lemmabased translation of nonstopwords as a front end for a monolingual system. However this transfer step is hamperedby the same problems as dictionary based query translation. The main difference with query translation is the availability of context. The question is howto use this context to improvethe transla- 129 NLPtools Xeroxprovides their finite state tools for morphological analysis and POSdisambigation. Fast PSG parser developed at TNO-TPD for NP-extraction. Document translation in TwentyOne Experimentswith word based translation and translation by Systran via the WWW have shownthe enormous difference in quality between these approaches. Therefore we will store translations of the documentsat the TwentyOne site for the purpose of presentation. Weknowalready, however, that not all language pairs are covered by commercialMT tools so a fall-back option is needed. Thefall-back option is called term translation. Withterm we refer to the mainindexing units of the TwentyOne system: noun-phrases.This meansthat in most cases, a term is complexi.e. consists of more than one concept. The challenge is to develop robust term translation techniques. The crucial part will be sense disambiguation.Ourhypothesis is that sense disambiguationis moreprecise in the ’documenttranslation’ context than in the ’query-translation’ context. In the DTcase we can exploit the context Context Sensitive TermTranslation (CSTT) The envisaged CSTI’ module is based on two kinds of lexical re- to CLIRis Documenttranslation, because DTcan fully exploit context for disambiguation.But we expect the following problems: 1. OCRerrors will not be translated 2. Part of the domainspecific terminologyis not covered by the available transfer resources 3. Somelanguage pairs might be stuck by poor DTfunctionality Querytranslation can partly alleviate the effects of these problemsin the following ways: 1. A documentwith a relevant term which contains an OCR error can be found via fuzzy matchingwith the translated query concept. sources: 1. General purpose machinereadable dictionaries 2. The user can perform relevance feedback in the target language, once a relevant documentis found in the particular foreign language. This techniqueis also useful to overcomethe effects of translation ambiguity 2. Domainspecific (Multi-word-term)lexica, based on term alignment from parallel corpora and manualtranslations of key terms in the domain. 3. A word based translation approach followed by a ranked boolean query (cf. (Hull & Grefenstette 1996) ) can as a disambiguatingfilter. Whena phrase is not found in the domainspecific term lexicon, the CSTrwill revert to a wordby wordtranslation. This process yields a numberof possible translations for each word in the phrase, corresponding to a large number of candidate translations of the phrase. Weproposeto filter out the best translations by a combinationof techniques: 4. Interactive disambiguationby the user Query translation in TwentyOne will use a multilingual lexicon which comprises both lemmas(including syntactic category) and multi-word-expressions. This lexicon will be based on the mergeof existing multilingual thesauri, bilingual machinereadable dictionaries and dictionaries derived from parallel corpora (Hiemstra 1996) and probably also somehand-codedtranslations for automatically indentified MWE’s. 1. Demotingcandidate phrases which do not occur in the document base, cf. (Radwan& Fluhr 1995) 2. Exploiting morphosyntacticrules describing the translation and formation process ofNP’s, cf. (Jacquemin1995) 3. Using cooccurenceinformation of word senses with context words and (Schuetze & Pedersen 1995) Sealability and Trade-offs The choice for "document-translation"is not very attractive fromthe perspective of scalability. Eachextra languagerequires an extra copy of the documentsand an extra index. There is howeverone pragmatic advantage, it’s possible to produce language specific CD-ROM versions which do not require (the expensive) Multilingual dictionary. Onepossibility to reduce the amountof required disk storage, neededfor translations is to do documenttranslation on the fly. Current translation services are still a bit slow, so gisting (Resnik 1997) or gloss translations could be an attractive compromise. Weenvisage different variants of the TwentyOnesystem, either based on documenttranslation (for small constrained domains),or on query translation (for larger documentbases). Anotherscalability aspect is the necessity to workvia an interlingua when more languages are added. In practice, 4. Keepingconsistent with previous translations of the same term within the sametext section. Translation Hyperlinks A second reason why we want to develop our own term translation methodologyis that we want to establish hyperlinks betweenterms and their translations. The result is a documentaligned with its three translations. The alignment between terms will be implementedby hyperlinks. MTsystems are file oriented and thus wouldrequire post translation alignment(reverse engineering). CLIR in TwentyOne Figure 1 showsthat the TwentyOnesystem will include both Documenttranslation and Query Translation because we expect that both approaches can improve the performanceof the system in their ownway. The main approach 130 Cu J--- I- I Original Dutch documents I I English to Dutch translated docs ,,.~° <.. Co ! ~- 0 \/~, I ,..! , e3 t70 t ~-. ,~ t o ,.4 ,~ ,7. I t\,,11 / t Dutch to English translated docs J I , ! >t Original English documents I- I" T i==io la,, o this can pose a problemin the case of integrating bilingual dictionaries from different sources which have a different interlingua. Radwan,K., and Fluhr, C. 1995. Textual database lexicon used as a filter to resolve semanticambiguity,application on multilingual information retrieval. In FourthAnnual symposium on DocumentAnalysis and Information Retrieval. Outlook In the project there is sometime available for evaluation. The evaluation will be both based on feedback from "real" users because the system will be operational on the WEB during the project, but also a small scale test with the usual measureslike averageprecision is foreseen, probablyin the context of the Multilingual track of TREC7. References Davis, M., and Ogden, W. 1997. Implementing crosslanguage text retrieval systems for large-scale text collections and the world wide web. In Proceedings of the AAAI97 workshop on Cross-Language Text and Speech Retrieval. Dumais,S. T.; Landauer, T. K.; and Littman, M. L. 1996. Automaticcross-linguistic information retrieval using latent semantic indexing. In Workshopon Cross-Linguistic Information Retrieval (SIGIR’96), 16-24. Erbach, G. 1997. Mulinex: Multilingual indexing, navigation and editing extensions for the world-wideweb. In Proceedings of the AAAI97 workshop on Cross-Language Text and SpeechRetrieval. Fluhr, C., and Radwan,K. 1993. Fulltext databases as lexical semantic knowledgefor multilingual interrogation and machinetranslation. In EWAIC’93. Hayashi, Y. 1997. Titan: A cross-linguistic search engine for the www.In Proceedings of the AAAI97workshopon Cross-LanguageText and Speech Retrieval. Hiemstra, D. 1996. Using statistical methods to create a bilingual dictionary. Master’s thesis, University of Twente. Hull, D., and Grefenstette, G. 1996. A dictionary-based approach to multilingual information retrieval. In Proceedings of the 19th ACMSIGIR Conference on Research and Developmentin Information Retrieval. Jacquemin,C. 1995. A symobolicand surgical acquisition of terms through variation. In Proceedingsof the IJCAI’95 workshop: NewApproaches to Learning for NLP. Johansson, C. 1996. Good bigrams. In Proceedings of COLING1996, 592-597. Oard, D. W., and Dorr, B. J. 1996. A survey of multilingual text retrieval. Technical report, University of Maryland. 132 Resnik, P. 1997. Evaluating multilingual gisting of web pages. In Working Notes of the AAAI97 workshop on Cross-LanguageText and Speech Retrieval. Schuetze, H., and Pedersen, J. O. 1995. Information retrieavl based on word senses. In FourthAnnualsymposium on DocumentAnalysis and Information Retrieval.