Multilingual functionality in the TwentyOne project

advertisement
From: AAAI Technical Report SS-97-05. Compilation copyright © 1997, AAAI (www.aaai.org). All rights reserved.
Multilingual functionality in the TwentyOneproject
Wessel Kraaij
NetherelandsOrganizationfor AppliedScientific Research(TNO)
Institute of AppliedPhysics
POBox155 2600 ADDelft
The Netherlands
kraalj @tpd.tno.nlor kraaljw@acm.org
SustainableDevelopment
The nameof the project refers
to the UNconferenceon this topic in Rio de Janeiro
1992.Theaimof the project is to build a systemthat supports and improvesdisseminationof informationabout
’local agenda21’ initiatives.
Abstract
TwentyOne is a EUfunded project which aims at
developing advanced indexing and retrieval techniques for
multimedia document bases. The document base consists
of documents in four languages: Dutch, English, French
and German. This paper focusses on the multilingual
aspects of the project: cross-language retrieval, partial
document translation techniques and automatic
hyperlinking between sour ce text and translations.
Introduction
12 is a project fundedby the EUTelematicsApTwentyOne
plication Programme.
Project partners include academic
partnerslike the Universitiesof Twente
andTiibingen,companieslike Getronicsand Xerox,contract research organistations like TNO
and DFKIand non-profit environmental
organisationslike Friendsof the Earth. Theproject can be
characterisedby the followingkeywords:
Document
conversionThe TwentyOne
system alms at the
disclosure of documents
of different mediatypes and/
or data formats e.g. paper documents,WEB
documents,
wordprocessordocuments,text annotatedimages,audio
or videomaterial.
Knowledge
based disclosure The q~ventyOneMultimedia documentbase will be disclosed using several
advancedtechniques like fuzzy matching, NLP-based
phraseindexing,relevancerankingand automatichyperlinking.
MultilingualityTheTwentyOne
database consists of documentsin different languages,initially Dutch,English,
French and Germanbut extensions to other European
languagesare envisaged.
DisseminationModelThe environmental partners develop an informationtransaction modelwhichworkslike
a perpetuummobile. Both information providers and
seekers profit fromthe model,the formerby increasing the number
of potential customers,the latter because
moreinformationbecomesavailable. Theproject tries to
stimulateinteractionandraise awarenessof local agenda
21initiatives in Europe.
ApplicationorientedThe most important deliverable of
the project is the profiling systemwhichproducesan index on the multilingual multimediadocumentbase. This
index will be available via CDROM
andaccessible via a
WEB
server.
BecauseqlventyOne
is fundedby an application oriented
programme,
the project has only limited resourcesfor fundamentalresearch, Someaspects of the systemtouch open
research problemsthough. Theconsortiumhas assessedthe
state of the art of technology
in theseareas (e.g. cross languageinformationretrieval). Becausea first versionof the
systemhas to be finished by the end of 1997, wetake a
pragmaticapproach,by integrating available tools andresourcesanddevelopingsolutionsfor missinglinks.
This paper focusses on the multilingual functions of
qkventyOne.
Theyare threefold:
.
2.
1 Thispaperdescribesjoint workwithcolleagues
at TNO-TPD,
Universityof Twente,
DFKI,XeroxandUniversityof Tflbingen
3.
2The TwentyOne homepage can be found at:
http: //www.
tno.nl/
127
retrieval of documents
in anotherlanguagethan the query
language(CLIR),supported languagesare Dutch, English, Frenchand German
(partial) translation of documentsto enable content
judgementby the user
Automatichyperlinks between index terms and their
translations (alignedmultilingualdocuments)
Froma research perspective, attacking four languagesat
once complicates things considerably. Scalability of the
system and separation of language dependent from language independent resources becomesmore important than
in the two-languagecase whichhas been investigated in detail, especially in the last few years. A comparablepredecessor is the ESPRITII project EMIR(Fluhr & Radwan
1993) which covers a subset of the TwentyOnelanguages:
English, French and German.EMIRis based on the SPIRIT
ranked boolean engine combinedwith a multilingual thesaurus as front-end. EMIRis currently being extended to
Russian. Other, morerecent, initiatives with a comparable
objective are:
ing indexingtime (off-line) or as a pre-processingstep
the retrieval process(on-line).
2. The translation process can be based on three sources of
transfer knowledge:
(a) MTsystems
(b) Bilingual dictionaries or thesauri
(c) Parallel corpora
Wewill discuss all possible combinations of the approaches and resources.
Query translation (on-line translation)
Dictionary based approach Simple word by word translation of the query terms has been evaluated in e.g. (Hull
Grefenstette 1996). It is the most simple approach to CLIR
as ambiguityis left unresolved: each (lemmatised)word
substituted by all its possible translations. Twoproblems
are prominent:
1. MULINEX
(Erbach 1997) : A Multilingual search engine for German,French and English
2. TITAN(Hayashi 1997): A search engine to search in English Webpages with a Japanese query
(Davis & Ogden 1997) A search engine
3. MUNDIAL
search in Spanish Webpages with an English query
1. Polysemy:
Translation of queryconceptsis likely to decrease precision when the word sense cannot be disambiguated.
Example: the Dutch word "slag" can be translated to
both "battle" or "stroke".
On the other hand, if more than one equivalent translation is available, translation could increase recall, because synonymsare added to the query. Hull proposes to
use a ranked Boolean query model as a possible way to
cope with this problem. In this modeldocumentsare ordered on the numberof (translations of ) query concepts
that are matched. This modelwill probably not work so
well for short (1-3 term) queries. Because documents
that matchonly one query concept have a high probability of being totally off topic whenthat query term has
multiple translations. In a moreabstract sense, the documentdatabase itself is used as a disambiguatingfilter,
but the windowsize of the filter is rather large i.e. the
full size of a document.This methodcould probably be
enhanced by restricting the windowsize, which would
require storing position information of each wordin the
index.
It’s obvious that developmentof full documenttranslation software is far beyondthe scopeof the project. Therefore we have planned to evaluate available commercialsystems and develop supplementary shallow term translation
moduleswhere needed(i.e. for missing languagepairs).
The automatic hyperlinking function attaches typed hyperlinks between terms, phrases or images etc. These
links can be either static (generated off-line) or dynamic,
in which case a link is evaluated by a CGI program. We
haveplanned to generate hyperlinks for all translated NounPhrases, which makesit easy for the user to jumpbetween
translated and original text.
In this paper we will concentrate on CLIRand partial
documenttranslation, because these functions can be combined in several aspects. Wewill first present someresults
from CLIRexperiments which have inspired the design.
Subsequently we will discuss the TwentyOneapproach to
CLIRand Documenttranslation which is also influenced
by the availability of linguistic resourceslike bilingual dictionaries.
2. Multi word expressions(MWE’s)
Idiomatic expressions, terminology, collocations are a
notorious problemin CLIR.Wordbased translation falls
here because often the meaning of the MWE
is not compositional, e.g. yellow pages. A terminologyor idiomatic
dictionary can only partly leverage the problembecause
most of the MWE’s
are highly domainspecific.
Concise overview of approaches to CLIR
Wewill present the possibilities for CLIRin a slightly dif3 than the one used in the overviewarticle
ferent taxonomy
by Oard(Oard &Dorr1996). CLIRsystems can be classified in two ways:
1. The stage in the disclosure process at whichthe language
transfer takes place. Translation can be doneeither dur3This taxonomyhighlights the first dilemmain CLIRsystemsdesign:either on-line querytranslation or off-line document
translation
128
MTbased approach Typical queries in current popular
IR systems like"Websearch engines" tend to be very short.
Therefore the advantage of MTsystems (which in principle can exploit syntactic and semanticaspects of context to
improve translation) with respect to dictionary based approaches is questionable. On the other hand, for longer
queries (Query by example, search similar documents)
could yield good results. The EMIRproject has compared
SYSTRAN
query translation with thesaurus based translation, averageprecision of the latter systemturned out to be
muchbetter.
tion. Wewill sketch a possible approachin a later section.
Partial translation of nounphrases for presentation purposes has to meet higher requirements than the query translation case: getting the wordsenses right is not enoughbecause wordorder and inflection have to be correct in order
to makethe translation readable. This step requires syntactic and morphological knowledge.
Corpusbased approachParallel corpora implicitly encode a lot of transfer knowledge. This knowledgecan be
exploited in different ways:
The TwentyOne approach
In this section we will discuss the design choices we have
madein order to build a systemwith the three multilingual
aspects whichwereintroducedearlier. Wewill start by listing the relevant resources.
1. Derivingbilingual dictionaries from aligned corpora. Especially domainspecific aligned corpora are of great
value to infer translations of or at least identify MWE’s.
Theseare of key value for CLIRbut can’t be dealt with
by simple word-basedtranslation. In fact this is also a
dictionary based approach.
2. Store dual-language documentsin a dual-language vector space, PerformLatent Semanticindexing on the dual
language documents before folding in the monolingual
documents.The LSI space captures a "multi-lingual semantic space" on which the monolingual documents are
mapped.Positive results are reported in (Dumals,Landauer, &Littman 1996). An advantage of this approach
is that alignmentof the parallel coporais only necessary
on the documentlevel.
Availability of Resources
Bi-or multilingualdictionaries Wehave contacts with
two Dutch publishers. The material is either a collection of bilingual dictionaries fromDutchto the other languages or a multilingual thesaurus, including morphological information. The lexical database even contains
translations of idioms and collocations, which might be
extremely valuable. Wedon’t know about MRD’sof
publishers in other Europeancountries whichoffer translation to and from Dutch
EU materials The EU has published the EUROVOC
thesaurus, a collection of commonlyused terminology in
EUdocuments.The thesaurus is electronically available
Documenttranslation (off-line)
MTbased full translation If we translate all documents
to the query language, than CLIRis reduced to a monolingual IR case. A disadvantage of the approach is the dependencyon imperfect MTsystems which are often closed
monolithic systems with (probably) limited coverageof domain terminology. Another disadvantage is that MTsystem
deliver only one translation in case of synonyms.Anadvantage howeveris that the translated documentscan also be
used for presentation to the user, which makessense when
translating from languages of which the user even has no
passive knowledge. Machinetranslation of complete documents is obviously moreworthwhile than translating short
queries, because the MTsystem can use the whole document as context. Dumals (Dumals, Landauer, & Littman
1996) reported favourable results of documenttranslation
by SYSTRAN
in combination with monolingual LSI.
Parallel texts Wehave the official "Agenda21" conference documentin all the four languages. Weare still
trying to find parallel texts at EUor UNinstitutions.
CommercialMTsoftware Recently a survey of these
tools (examples can be found on the WEB)has been
started at DFKI. Weare not aware of a commercially
available MTsystem which supports the four languages
supported by TwentyOne
Monolingual IR system TwentyOnewill use the monolingual IR kernel of TNO-TPD
which supports:
¯ Vector Spaceretrieval
¯ Booleanretrieval
¯ Fuzzy matching
Partial translation techniques Because most Indexing
models are based on lemmatised content words, a CLIR
system could be based on lemmabased translation of nonstopwords as a front end for a monolingual system. However this transfer step is hamperedby the same problems
as dictionary based query translation. The main difference
with query translation is the availability of context. The
question is howto use this context to improvethe transla-
129
NLPtools Xeroxprovides their finite state tools for morphological analysis and POSdisambigation. Fast PSG
parser developed at TNO-TPD
for NP-extraction.
Document translation
in TwentyOne
Experimentswith word based translation and translation by
Systran via the WWW
have shownthe enormous difference
in quality between these approaches. Therefore we will
store translations of the documentsat the TwentyOne
site
for the purpose of presentation. Weknowalready, however,
that not all language pairs are covered by commercialMT
tools so a fall-back option is needed.
Thefall-back option is called term translation. Withterm
we refer to the mainindexing units of the TwentyOne
system: noun-phrases.This meansthat in most cases, a term is
complexi.e. consists of more than one concept. The challenge is to develop robust term translation techniques. The
crucial part will be sense disambiguation.Ourhypothesis is
that sense disambiguationis moreprecise in the ’documenttranslation’ context than in the ’query-translation’ context.
In the DTcase we can exploit the context
Context Sensitive TermTranslation (CSTT) The envisaged CSTI’ module is based on two kinds of lexical re-
to CLIRis Documenttranslation, because DTcan fully exploit context for disambiguation.But we expect the following problems:
1. OCRerrors will not be translated
2. Part of the domainspecific terminologyis not covered by
the available transfer resources
3. Somelanguage pairs might be stuck by poor DTfunctionality
Querytranslation can partly alleviate the effects of these
problemsin the following ways:
1. A documentwith a relevant term which contains an OCR
error can be found via fuzzy matchingwith the translated
query concept.
sources:
1. General purpose machinereadable dictionaries
2. The user can perform relevance feedback in the target
language, once a relevant documentis found in the particular foreign language. This techniqueis also useful to
overcomethe effects of translation ambiguity
2. Domainspecific (Multi-word-term)lexica, based on term
alignment from parallel corpora and manualtranslations
of key terms in the domain.
3. A word based translation approach followed by a ranked
boolean query (cf. (Hull & Grefenstette 1996) ) can
as a disambiguatingfilter.
Whena phrase is not found in the domainspecific term
lexicon, the CSTrwill revert to a wordby wordtranslation.
This process yields a numberof possible translations for
each word in the phrase, corresponding to a large number
of candidate translations of the phrase. Weproposeto filter
out the best translations by a combinationof techniques:
4. Interactive disambiguationby the user
Query translation in TwentyOne
will use a multilingual
lexicon which comprises both lemmas(including syntactic
category) and multi-word-expressions. This lexicon will be
based on the mergeof existing multilingual thesauri, bilingual machinereadable dictionaries and dictionaries derived
from parallel corpora (Hiemstra 1996) and probably also
somehand-codedtranslations for automatically indentified
MWE’s.
1. Demotingcandidate phrases which do not occur in the
document base, cf. (Radwan& Fluhr 1995)
2. Exploiting morphosyntacticrules describing the translation and formation process ofNP’s, cf. (Jacquemin1995)
3. Using cooccurenceinformation of word senses with context words and (Schuetze & Pedersen 1995)
Sealability and Trade-offs
The choice for "document-translation"is not very attractive
fromthe perspective of scalability. Eachextra languagerequires an extra copy of the documentsand an extra index.
There is howeverone pragmatic advantage, it’s possible to
produce language specific CD-ROM
versions which do not
require (the expensive) Multilingual dictionary.
Onepossibility to reduce the amountof required disk storage, neededfor translations is to do documenttranslation
on the fly. Current translation services are still a bit slow,
so gisting (Resnik 1997) or gloss translations could be an
attractive compromise.
Weenvisage different variants of the TwentyOnesystem, either based on documenttranslation (for small constrained domains),or on query translation (for larger documentbases).
Anotherscalability aspect is the necessity to workvia an
interlingua when more languages are added. In practice,
4. Keepingconsistent with previous translations of the same
term within the sametext section.
Translation Hyperlinks
A second reason why we want to develop our own term
translation methodologyis that we want to establish hyperlinks betweenterms and their translations. The result is a
documentaligned with its three translations. The alignment
between terms will be implementedby hyperlinks. MTsystems are file oriented and thus wouldrequire post translation alignment(reverse engineering).
CLIR in TwentyOne
Figure 1 showsthat the TwentyOnesystem will include
both Documenttranslation and Query Translation because
we expect that both approaches can improve the performanceof the system in their ownway. The main approach
130
Cu
J---
I-
I
Original Dutch documents
I
I
English to Dutch translated docs
,,.~°
<..
Co
! ~-
0
\/~,
I ,..!
,
e3
t70
t ~-.
,~
t o
,.4
,~
,7.
I
t\,,11
/
t
Dutch to English translated docs
J I ,
!
>t
Original English documents
I-
I"
T
i==io
la,,
o
this can pose a problemin the case of integrating bilingual
dictionaries from different sources which have a different
interlingua.
Radwan,K., and Fluhr, C. 1995. Textual database lexicon used as a filter to resolve semanticambiguity,application on multilingual information retrieval. In FourthAnnual symposium on DocumentAnalysis and Information
Retrieval.
Outlook
In the project there is sometime available for evaluation.
The evaluation will be both based on feedback from "real"
users because the system will be operational on the WEB
during the project, but also a small scale test with the usual
measureslike averageprecision is foreseen, probablyin the
context of the Multilingual track of TREC7.
References
Davis, M., and Ogden, W. 1997. Implementing crosslanguage text retrieval systems for large-scale text collections and the world wide web. In Proceedings of the
AAAI97 workshop on Cross-Language Text and Speech
Retrieval.
Dumais,S. T.; Landauer, T. K.; and Littman, M. L. 1996.
Automaticcross-linguistic information retrieval using latent semantic indexing. In Workshopon Cross-Linguistic
Information Retrieval (SIGIR’96), 16-24.
Erbach, G. 1997. Mulinex: Multilingual indexing, navigation and editing extensions for the world-wideweb. In
Proceedings of the AAAI97 workshop on Cross-Language
Text and SpeechRetrieval.
Fluhr, C., and Radwan,K. 1993. Fulltext databases as
lexical semantic knowledgefor multilingual interrogation
and machinetranslation. In EWAIC’93.
Hayashi, Y. 1997. Titan: A cross-linguistic search engine
for the www.In Proceedings of the AAAI97workshopon
Cross-LanguageText and Speech Retrieval.
Hiemstra, D. 1996. Using statistical
methods to create a bilingual dictionary. Master’s thesis, University of
Twente.
Hull, D., and Grefenstette, G. 1996. A dictionary-based
approach to multilingual information retrieval. In Proceedings of the 19th ACMSIGIR Conference on Research
and Developmentin Information Retrieval.
Jacquemin,C. 1995. A symobolicand surgical acquisition
of terms through variation. In Proceedingsof the IJCAI’95
workshop: NewApproaches to Learning for NLP.
Johansson, C. 1996. Good bigrams. In Proceedings of
COLING1996, 592-597.
Oard, D. W., and Dorr, B. J. 1996. A survey of multilingual text retrieval. Technical report, University of Maryland.
132
Resnik, P. 1997. Evaluating multilingual gisting of web
pages. In Working Notes of the AAAI97 workshop on
Cross-LanguageText and Speech Retrieval.
Schuetze, H., and Pedersen, J. O. 1995. Information retrieavl based on word senses. In FourthAnnualsymposium
on DocumentAnalysis and Information Retrieval.
Download