The Temple Web Translator

advertisement
The Temple WebTranslator
R~mi Zajac and Mark Casper
ComputingResearch Laboratory
NewMexicoState University, Las Cruces, NM88003
{ zajac,mcasper}@crl.nmsu.edu
ElectronicCopy:http://crl.nmsu.edu/temple/twfftwt.aaai97.html
From: AAAI Technical Report SS-97-02. Compilation copyright © 1997, AAAI (www.aaai.org). All rights reserved.
Abstract
New
Websites in foreignlanguages
are appearingeveryday,
andlanguagebarriers threatento atomizethe WorldWide
Web
into closedlinguisticcommunities.
TheTemple
project
has developedan open multilingual architecture and
software support for rapid developmentof machine
translationsystems
for assimilationpurposes.Thetargeted
languagesare thosefor whichnatural language
processing
andhuman
resourcesare scarceor difficult to obtain. The
goalis to supportrapiddevelopment
of machine
translation
functionalities
in a veryshorttimewithlimitedresources.
A
Webfront-endallowsa user to submitan HTML
pageto the
machine
translationserverandget backthe translatedpage
in English.
Introduction
WebMachineTranslation (MT)is already a reality for
several millions of JapaneseWebsurfers whoprefer to read
an l~nglish HTML
page in bad Japaneserather than in the
Englishoriginal. Thesecrude MTsystems(see for example
Church & Hovy 93. Kay 80) have several common
characteristics:
¯ Theyare small enoughto run on low-endPCs,
¯ Theyare robust enoughto process any HTML
input
document,
¯ Theyare fast enoughso as to addonly a smalldelay
in accessinga document,
¯ Despitethe lowquality of the translation they have
wide acceptance, a typical feature of MTfor
1assimilationas opposedto disseminationpurposes,
¯ Theyare cheap.
A multilingualWebtraaslator wouldlimit the advantage
of small size, makingpersonal multilingual MTsystems
impractical for more than a few languages. Whentrue
multilinguality is concerned, nmnin~ aa MTsystem on a
workstationis probablythe adequateanswer, although a
smallsize for each component
is still desirable. Thebest
state-of-the-art MTsystemsproducemuchhigher quality
1. Assimilationis whata Websurfer or an analyst does by
browsinga large numberof documents
tor retrmvingsome
intormat~on.
Omsemination
is, for example,
the publicationof
a manualor a commercial
brochure.
translations than those of cheapWebtranslators. After all,
they are mostlytargeted for dissemination,whosequality
requirementsare far morestringent than for assimilation.
However,
to use these traditional systemsfor assimilationis
probablyan overkill and could evenbe counter-productive
in the long run; these systemsare usually specializedfor
somespecific texts anddomainsand amrather difficult to
maintainand upgrade.
2 demonstratedthe feasibility
TheTempleproject at CRL
of rapid developmentof multilingual MTsystems for
assimilationpurposesby using a selection of state-of-theart MTtechnologies, e.g., finite-state technologyand
automatedacquisition of languagesresources. This article
gives a brief overviewof the Templeproject, a description
of the TempleWebfront-end, and discusses aspects of MT
for the World-Wide Web, some of which are being
researchedin the on-goingCorelli project at CRL.
Background: the Temple Project
The Templeproject has developed an open multilingual
architecture and softwaresupportfor rapid development
of
extensible MachineTranslation functionalities. The
targeted lanmaagesamthose for whichnatural language
processingand humanresources are scarce or difficult to
obtain. The goal was to support rapid developmentof
machinetranslation functionalitiesin a very short timewith
limited resources.
Glossary-BasedMachine-Trauslation (GBMT)
is used
provide an English gloss of a foreign document.A GBMT
systemuses a bilingual phrasal dictionary (glossary)
producea phrase-by-phrasetranslation. Translation(based
on phrasepattern-matching)is fast andaccurate regarding
the content of the documentand browseddocumentscan be
translated almost in real-time. A GBMT
system for a
languagepair is also extremelysimple, cheap ~ndfast to
develop. Moreover,all language resources used by the
systemare entirely underthe controlof the user.
The Temple GBMT
system has been integrated in the
Templemultilingual analyst’s workstation(Zajac & Vanni
96). This workstation uses a Tipster documentserver
developed at CRL(Grishmaa 95) to store and retrieve
2. http://crl.nmsu.edu/temple.
149
documents
and associated information, such as linguistic
structures producedby someNLPcomponent.The analyst
w{xkstation
offers to an English-speaking
analyst a variety
of tools to browsesets of documents
in Arabic,Japanese,
Spanishand Russian, including a Uuicode-based
(Uuicode
96) multilingual editor, a simple machinetranslation
functionality that is implementedusing the GBMT
system
andotherinformation
retrieval tools (Callanet al. 92).
Glossary-Based MT
GBMT
systemsput an emphasison a feature of natural
languagethat is frequently overlookedby (computational)
linguists, i.e., the fact that the meaning
of a sentenceis very
often non-compositional. Idioms, collocations and
specialized terminology,whichoccur muchmoreoften than
is usually assumedin real texts, must all be listed
individuallyin the lexical databaseof the system.This is
the very reasonwhyolder systems,whosedictionaries and
other lexical resourceshavefrequentlybeendevelopedover
several decades, still out-performnewersystemsusing
better technology.
GBMT,
as exemplified by the Temple system, is a
suitable paradigmfor MTfor assimilation. Usingfinitestate technologythroughout, a GBMT
system can be both
fast andcompact.Byusingvariouslevels of defaultsin case
of failure at a higherlevel, translationis also very robust.
Moreover, these types of systems are mucheasier to
developthan moretraditional MTsystems; all the basic
resourcesusedby the systemcan be presentedto a user in a
very intuitive anddeclarativeway,greatly facilitating the
maintenance
andevolutionof the system.
Developinglexical resources for Mr systems is the
highest-cost component
for any type of system. Automating
the acquisition of such resources is therefore a very
importanttopic, addressedin current MTresearchunderthe
learningor inductionparadigms.Therestriction to finitestate technologyshould bring important advantagesfor
leamin£since induction algorithmsfor regular languages
givemuchbetter results thanfor other classes of languages.
Glossary-BasedMachineTranslatien (GBMT)
was first
developedat CarnegieMellonUniversity as part of the
Panglossproject (Cohenet al. 93, Nirenburget al. 93,
Frederking et al. 93), and a sizeable Spanish-English
GBMT
system was implemented.The Templeproject has
built upon this experience and extended the GBMT
approach to other languages: Japanese, Arabic, and
Russian.This experiencewith other languageshas provided
significant insights for the developmentof a versatile
GBMT
engine and for the use of off-the-shelf components
for building a completeMTsystem. Building a generic
platform for integrating various MTsystemsin a single
flexible user environment,built uponthe Tipster document
architecture, has also been a valuable experience for
developinggeneric natural languageprocessing support
systems.
The User Perspective
TheTempleAnalyst’s Workstationis incorporated into a
Tipster documentmanagement
architecture and allows both
lxanslator/analysts andmonolingual
analysts to use the Mr
function for assessing the relevance of a t~anslated
documentor otherwise using its information in the
performanceof other types of information processing.
Translators can also use its output as a roughdraft from
whichto begin the process of producinga translation,
followingup with specific post-editing functions. Theuser
(translator or analys0can:
¯ Browsea collection of documentsmanagedby a
Tipster DocumentManagerusing a Collection/
Documentbrowser,
¯ Viewand edit foreign languagedocumentsor their
Englishtranslations using the multilingualTipster
Editor for I)oonnents,
¯ Translate foreign documentsusing the generic
translationfunction,
¯ Browseand edit lexical resources (bilingual
dictionaries andglossaries) usingthe multilingual
TempleLexicalEditor.
Although
the translation providedby the systemis only a
phrase-for-phrase(or word-for-word)
gloss of the original,
the systemis entirely underthe control
oftheuser whocan
modifyanyessential part of it, i.e., the dictionaries and
glossaries. Froma user’s point of view, the system is
predictable, responsive, affordable and easy to use and
maintain. Like advancedMr systems, it uses reliable
morphologicalprocessors and taggers, componentswhich
are relatively inexpensive,requirelittle or no maintenance,
and greatly enhanceoutput quality of GBMT.
The MTDeveloper Perspective
A Multilingual Architecture. The Templearchitecture
uses a multilingualtext library developedat CRLto suppc~
multilingualtext processing.Thislibrary,
availablefor Unix
systems,is capableof handlinga large number
of character
codesets and provides muitilingual string
processing
functionalities and character codeconversionfor a large
variety of codesets. AmultilingualMotiftext widgetthat
can be embedded
in higher-level applications (as in the
TempleLexical Editor) and a simple muitilingual text
editor are also supported.This library provedto be a major
asset for the project since fewcomparablefunctionalities
for the range of languagesprocessedin the Templeproject
are availablein the Unixenvironment.
150
Althoughfull Unlcodesupportwasa goal of the project,
this could not be achievedentirely. Currently. only the
Arabic morphological analyzer has built-in Unicode
support(as well as other codesets);however,
a full Unicode
library shouldbe availablefor use in the Corelli project(see
below).
Reuseof Machine-Readable
Dictionaries.
Bilingualdictionaries are processedversions of various
machine-readabledictionaries (MRDs),for example,the
Collins Spanish-English dictionary (or any other MT
dictionaries that have been restructured to conformto
Temple’slexical format, see Stein et al. 93). Since
morphologicalanalyzers and dictionaries maycomefrom
different sources, they mayhave incompatible lexical
representation, as it happenedfor the Japanese-English
dictionary and the Japanese morphologicalanalyzer. In
such cases, integration is achieved by mapping
the
dictionary, including, for example, part-of-speech
information,to a standardizedformat, and by developinga
filter that mapsthe morphological
analyzerresults to that
structure.
Reuse of Natural LanguageProcessing Components.
Animportantdecision in the Templeproject wasto use
available NLPcomponents and resources whenever
possible (Farwell 94, Penman88, Matsumoto
93). This led
to the definition of an openarchitecture that provides
support for integrating external tools. SomeTemple
componentshave been developedas part of the project
(e.g., the ArabicandRussianmorphological
analyzers) but
havebeen integrated using the samemethodology
as other
tools. TheTemplearchitecture is built arounda nni~olai
internal canonical linguistic representation for all
languages;all components
read or maptheir resnlts to this
representation, whichuses Tipster annotations. All NLP
tools are encapsulatedusing Tcl wrappersfor mappingthe
tool representationto the Temple
representation.
Morphological
informationis transferred fromthe source
to the target languageusingmorphological
transfer tables
that mapcategories andfeatures froma sourcelexical item
to the equivalent English lexical item. The GBMT
engine
itself is fully generic andis parametrizedby a bilingual
glossaryanda morphological
transfer table.
Semi-automatic
Development
of Glossaries,
Small glossaries (between 2,000 and 20,000 entries
dependingonthe language)
have beendevelopedfor each
lan~age.Theacquisition processis as follows:
1. An Ngramextraction programis used to collect
recurrentwordpatterns in a givencorpus(Daviset
al.
95);
2. This set of patterns is loaded in the Lexical
Databaseas partial glossaryentries;
3. Thetranslation of each entry is addedmanually.
The development cost of such glossaries remains
relatively low since the structure and the information
encodedin a glossaryentry is very simple.
MTfor the Web
Althoughthe Templesystem wasnot originally designed
for Webbrowsing,it wasneverthelessdesignedfor analysts
browsinglarge numbersof relatively small, a fewpagesat
most, heterogeneousdocuments,in different domains,a
variety of styles, and different languages. The
characteristicsof the texts are quite close to mostof whatis
foundin Web
sites, andthe analyst task is also close to the
typical behaviorof a Websurfer. Totest the adequacyof
this hypothesis, we developeda Webfront-end for the
Temple MTsystem which includes an ~ parser. This
experimentis describedhereafterin the followingsection.
TheTemplesystemis a laboratory prototype whichwas
used to demonstratethe feasibility of the approachon a
realistic scale. In a secondphase, whichstarted during
snmmer1996, the Corelli project is re-implementingsome
key componentsof the system and developingthe Temple
technologyin the followingdirections:
¯ Developinga newNLParchitecture to serve as the
integration layer of a multi-engineMrsystem(in
cooperationwith the Mikrokosmos
projec0.
¯ Developingand extending the pattern-matching
technologythat is at the core of the GBMT
system.
¯ Developing a comprehensive set of lexical
acquisition tools for the automateddevelopment
of
glossariesandother lexical resources.
¯ Developinga Translation Editor integrated in a
DocumentPreparation System(FrameMaker).
¯ Applyingthis technologyto newlanguagepairs.
The Temple Web Translator
The Websurfer is not interested in MT,only in the
document’scontent, andany impediment
to straightforward
access is considereddetrimental. Therefore,the MTsystem
shouldbe as unobtrusiveas possible. In the ideal scenario,
both browsersand servers wouldsupportrelevant features
of the HTI’Pprotocol, i.e., languagenegotiation(only a few
currently provide this functionality), and the charset and
languageattributes. Thesefeatures are used in building
applications that automatically select the appropriate
languageandcharacterset for display.
These features however do not help to process a
document mixing several languages (e.g., for many
JapaneseWebpages) and character sets, somethingwhich
is unavoidable when, for example, an unknownforeign
wordneeds to be inserted in the Englishtranslation. The
151
current HTML
standard does not provide support for
multilingual documents, although the World-WideWeb
Consortium OV3C) has issued a draft on the
internationalization of HTML
(the so-called ’ilSn’ draft,
see Nicol et al. 96) andproposesthat a future version of
HTML
support Unicodeand a ’lang’ attribute (see also
CarrascoBenitez 96 for an overviewof issues related to
il8n and multilingualism).
In the ideal scenario,onthe user side, a user wouldset an
ordered llst of preferred languagesfor browsingand get
back the documentin one of the languagesrequested. On
the server side, the server wouldautomaticallyselect the
appropriateversionif availableat the site. If the document
is not available in any of the languagesspecified by the
user. there are twopossibilities for translating a document
in a requested language since the MTsystem could be
locatedat the server or at the browser.If the MTsystemis
located at the server site, the server could send the MT
systema documentin one of the languagesacceptedby the
system and get a translation in one of the languages
requestedby the user. If the MTsystemis located at the
user site. languagenegotiation can be usedby the browser
for requesti,~ a translation beforedisplayingthe document.
There are also classical problems in relating and
displayinga document
and its translation. Thesourceand
its translation couldbe displayedside-by-sidein different
browsers, side-by-side in the samebrowser, or using an
interlinear modelwithsentencelevel translations occurring
adjacent to corresponding source sentences. Several
proposals have already been madeto extend HTPPand
3HTML
to handlethis set of problems(e.g., Bryan96).
Whilewaitingfor full supportof the i18u draft, wehave
implementedan interface in which a user can send a
translation request to the TempleTranslationServer: the
user provides the URLand the Translation Server sends
backthe Fnslish translation of the foreign Webpagein a
newbrowser.
TheTempleWebTranslation architecture is pictured in
Figure 1. A Webbrowserdisplayingthe Javascript-enabled
WebTranslatorTool(Figure 2) allowsthe user to enter the
URLof a documentto be translated. Thetool generates a
translation request for the indicated document,whichis
delivered to CRL’sTempleTranslation Server via CRL’s
Webserver and a CGIscript. The TempleTranslation
Server retrieves the documentfrom the Web.parses the
documentstructure, translates its textual contents, and
returns the translated document, with original HTML
formatting intact, to the user. The newlytranslated
documentis then displayedin a separate browserwindow.
CTranslatorTool),~
URL runs ated Document
~CRLWEBServ!r~
~L- Translated~ocument
|
(WebServer
ent
Translated
!eument
~ouree D;L( CGI Script
TempleTranslationServer-~
/
I Codeset ITranslationlLanguage/
~onverter [ Engines [Recogmze~//
Figure 1: CRL
WebTranslationSystemArchitecture.
The current implementation of the WebTranslation
Systemalso supports delivery of a translated documentto
the user via e-mail.In the future, wewill also providean emall interface to the Translation Systemto allow for
asynchronous
access: as currently envisiened,the user will
be able to submit either a URLor a mime-encoded
document
for translation.
For translating Webpages from any site in the world,
several problemsneedto be addressed.After retrieving the
selected document
fromthe Web,the TranslationServer is
faced with the tasks of languageandcodesetdetermination,
possible codeset conversion,and HTML
parsing.
First, the server mustdeterminethe document’s
language
and codeset characteristics. Our first WebTranslation
System implementation assumes that the HTTPserver
delivering the documentcorrectly returns the document’s
codesetin the H’ITPheader. If the codesetof the document
can not be determinedor does not appear to matchone of
the supportedtranslation languages,a correspondingerror
is returned to the user. If the user is confident that the
documentis indeedin one of the supportedlanguages,he/
she mayspecify the correct languageand codeset via the
TranslatorTooland resubmitthe request.
3. All referenceson internationalizationof the Webcan be
obtainedat the W3C
website, http://www.w3.org.
152
~~iTramlator
,4.
.:!
........
l
I
HTML
version 3.2 is used to extract the textual content of
the documentfor translation. Theonly formatting that
might be lost or misaligned would be non-strnctural
typefacingtags due to non word-for-word
translation. For
instance, the original documentmighthavea single word
highlightedin a groupof wordstranslated as a phrasalunit.
Sinceit is often difficult if not impossibleto associatea
specific sourcewordwith its equivalenttarget word,this
type of formattingwouldbe lost.
Conclusion
The emergenceof efficient and low-cost MTtechnology
combinedwith advancesin MTarchitecture (interlinguas,
reversibility)
make low-cost multilingual MTfor
assimilationpurposesattractive, andthe rapid emergence
of
multinational Websites and accompanyingtranslation
needsseemsto be an excellent opportunityto deploysuch
technology.This is not to say that all problemshavebeen
solved, but there are already MTsystems using mature
elementsof this technology. Since all elementsof this
technologyhaveprovenfeasible, webelieveit nowpossible
to designa MTarchitecture for assimilation purposesthat
can support rapid development
of multilingual MTsystems.
TheCorelli project4 is aimedexactlyat this.
Figure2: WebTranslator
ToolGUI.
Oncethe codeset of the documentis known,it will be
possible to determinethe associatedlanguagefor many,if
not all, languages.For somedocuments,the languagemode
will be included in the HTrPheader by the Webserver.
Whenthis is not the case. manydocuments
will be encoded
usingcodesetsthat are uniquelyassociatedwith a language.
The remaining (difficulO documentsthen will be those
encodedusing codesets such as ISO-8859-1
(ISO-latinl),
whichcorrespondto multiple languages. Thetranslation
server will attemptto disambiguate
these cases by looking
first for a languageMEFA
tag within the documentitself.
Whereno suchtag exists, the server will attemptto utilize
CRL’s
statistical languageguesserto select the mostlikely
languagegiventhe codeset.
Thesecondproblemfacing the TranslatienServer is the
needto present the document
to the TranslationEnginein
oneof the supportedcodesets.If the document’s
codesetis
not one of those supportedby the particular language’s
translation engine, then the document
will be convertedby
CRL’scodeset conversion tool as appropriate. CRL’s
conversion tool currently supports conversion
between approximately 90 of the most frequently used
codesets.
Finally, the TranslationServermustpreserve,as closely
as possible, the original HTML
formattingof the document
duringthe translation process. AnHTML
parser supporting
Acknowledgments.Theauthors wish to thank Sergei
Nirenburg,Jim Cowie,Bill Ogdenand MichelleVannifor
their help and support, and for their contribution to the
Temple project. Wewould like to acknowledge the
collaboration of AhmedMalld, Vanishree Mahesh,Nick
Ourusoff, Heather Pfeiffer, SusumuDukeYasuda, Yin
Wanying,Daniel Wood,and also manyother students who
havecontributedto this project. Finally, wewouldalso like
to thanks anonymousreviewers for their helpful and
constructive comments.The Templeproject has been
funded by the USDoDunder grant 94-R-3075/A0001.
References
Callan. J.P.; Croft, W.B.; and Harding, S.M. 1992. The
INQUERY
Retrieval System. In Proceedings of the 3rd
International Conferenceon Databaseand Expert Systems
Applications,78-83.Springer-Verlag.
Carrasco Benitez,
M.T. 1996. "Winter: Web
Internationalization
& Multilinguism". INTERNEFDRAFT
<draft-benitez-winter-cultures-00.txt>, May16th,
1996.
Church, K. and Hovy, E. 1993. GoodApplications for
Crnmmy
MachineTranslation. MachineTranslation 8(4):
239-258.
4. http://crl.nmsu.edu/eorelli.
153
Bryan, M. 1996. LinkingHTML
Translations. Version2.0,
The SGML
Centre.
Davis, M.; Dunning, T.; and Ogden, B. 1995. String
Matching Strategies and N-Gram Comparisons. In
Proceedingsof the 7th Conferenceof the EuropeanChapter
of the Associationfor ComputationalLinguistics, 27-31.
Dublin,Ireland: UniversityCollegeDublin,Beltield.
Frederking,R.; Grannes,D.; Cousseauo
P.; and Nirenburg,
S. 1993. An MATTool and Its Effectiveness. In
Proceedings of the DARPA
HumanLanguageTechnology
Workshop.
Princeton, NJ.
Farwell,D.; I-Ielmreich,S.; Jin, W.;Casper,M.; Hargrave,
J.; Molina-Salgado,
H.; and Weng,F. 1994. Panglyzer:
SpanishLanguageAnalysis System.In Proceedingsof the
Conferenceof the Associationof MachineTranslation in
the Americas,56-64. Columbia,MD.
Grishman,R. ed. 1995. Tipster Phase 1T Architecture
Design Document,Version 1.52. New-YorkUniversity.
(http://cs.nyu.edu/tipster)
Kay, M. 1980. The Proper Place of Menand Machinesin
MachineTranslation. XeroxPARC
Technical Report, CSL80-11.
1988. The PenmanPrimer, User Guide, and Reference
Manual.UnpublishedUSC/ISIdocumentation.
Matsumoto,Y.; Sadao, K.; Takeji, U.; Yutaka, M.; and
Makoto, N. 1993. Japanese Morphological Analysis
System, JUMAN.
Kyoto University, Nara Science and
Technology
GraduateSchoolUniversity(in Japanese).
Nicol, G.; Yergeau,F.; Adams,G.; and Duerst, M. 1996.
Internationalizatien of the HypertextMarkupLanguage.
Internet Draft <draft-ietf-html-i18n-05.txt>, Network
WorkingGroup.
Nirenburg,S.; Shell, P.; Cohen,A.; Cousseau,P.; Grammes,
D.; and McNeilly, C. 1993. Multi-purtxx~ Development
a.d OperationsEnvironments for Natural Language
Applications. In Proceedingsof the 3rd Conferenceon
Applied Natural L~n~ageProcessing, page nos. Trento,
Italy.
Stein, G.C.; Lin, E; Bruce, R.; Weng,E; and Guthrie, L.
1993. The Developmentof an Application Independent
Lexicon: LexBase.CRLTechnical Report, MCCS-92-247.
The Unicode Consortium. 1996. The UnicodeStandard,
Version2.0. Reading,Mass:Addison-Wesley.
Zajac, R. and Vanni, M. 1996. Glossary-BasedMTEngines
in a MnitilingualTranslat~’s Workstationfor Information
Processing. MachineTranslation, Special Issue on New
Tools for Human
Translators. Forthcoming.
154
Download