The Temple WebTranslator R~mi Zajac and Mark Casper ComputingResearch Laboratory NewMexicoState University, Las Cruces, NM88003 { zajac,mcasper}@crl.nmsu.edu ElectronicCopy:http://crl.nmsu.edu/temple/twfftwt.aaai97.html From: AAAI Technical Report SS-97-02. Compilation copyright © 1997, AAAI (www.aaai.org). All rights reserved. Abstract New Websites in foreignlanguages are appearingeveryday, andlanguagebarriers threatento atomizethe WorldWide Web into closedlinguisticcommunities. TheTemple project has developedan open multilingual architecture and software support for rapid developmentof machine translationsystems for assimilationpurposes.Thetargeted languagesare thosefor whichnatural language processing andhuman resourcesare scarceor difficult to obtain. The goalis to supportrapiddevelopment of machine translation functionalities in a veryshorttimewithlimitedresources. A Webfront-endallowsa user to submitan HTML pageto the machine translationserverandget backthe translatedpage in English. Introduction WebMachineTranslation (MT)is already a reality for several millions of JapaneseWebsurfers whoprefer to read an l~nglish HTML page in bad Japaneserather than in the Englishoriginal. Thesecrude MTsystems(see for example Church & Hovy 93. Kay 80) have several common characteristics: ¯ Theyare small enoughto run on low-endPCs, ¯ Theyare robust enoughto process any HTML input document, ¯ Theyare fast enoughso as to addonly a smalldelay in accessinga document, ¯ Despitethe lowquality of the translation they have wide acceptance, a typical feature of MTfor 1assimilationas opposedto disseminationpurposes, ¯ Theyare cheap. A multilingualWebtraaslator wouldlimit the advantage of small size, makingpersonal multilingual MTsystems impractical for more than a few languages. Whentrue multilinguality is concerned, nmnin~ aa MTsystem on a workstationis probablythe adequateanswer, although a smallsize for each component is still desirable. Thebest state-of-the-art MTsystemsproducemuchhigher quality 1. Assimilationis whata Websurfer or an analyst does by browsinga large numberof documents tor retrmvingsome intormat~on. Omsemination is, for example, the publicationof a manualor a commercial brochure. translations than those of cheapWebtranslators. After all, they are mostlytargeted for dissemination,whosequality requirementsare far morestringent than for assimilation. However, to use these traditional systemsfor assimilationis probablyan overkill and could evenbe counter-productive in the long run; these systemsare usually specializedfor somespecific texts anddomainsand amrather difficult to maintainand upgrade. 2 demonstratedthe feasibility TheTempleproject at CRL of rapid developmentof multilingual MTsystems for assimilationpurposesby using a selection of state-of-theart MTtechnologies, e.g., finite-state technologyand automatedacquisition of languagesresources. This article gives a brief overviewof the Templeproject, a description of the TempleWebfront-end, and discusses aspects of MT for the World-Wide Web, some of which are being researchedin the on-goingCorelli project at CRL. Background: the Temple Project The Templeproject has developed an open multilingual architecture and softwaresupportfor rapid development of extensible MachineTranslation functionalities. The targeted lanmaagesamthose for whichnatural language processingand humanresources are scarce or difficult to obtain. The goal was to support rapid developmentof machinetranslation functionalitiesin a very short timewith limited resources. Glossary-BasedMachine-Trauslation (GBMT) is used provide an English gloss of a foreign document.A GBMT systemuses a bilingual phrasal dictionary (glossary) producea phrase-by-phrasetranslation. Translation(based on phrasepattern-matching)is fast andaccurate regarding the content of the documentand browseddocumentscan be translated almost in real-time. A GBMT system for a languagepair is also extremelysimple, cheap ~ndfast to develop. Moreover,all language resources used by the systemare entirely underthe controlof the user. The Temple GBMT system has been integrated in the Templemultilingual analyst’s workstation(Zajac & Vanni 96). This workstation uses a Tipster documentserver developed at CRL(Grishmaa 95) to store and retrieve 2. http://crl.nmsu.edu/temple. 149 documents and associated information, such as linguistic structures producedby someNLPcomponent.The analyst w{xkstation offers to an English-speaking analyst a variety of tools to browsesets of documents in Arabic,Japanese, Spanishand Russian, including a Uuicode-based (Uuicode 96) multilingual editor, a simple machinetranslation functionality that is implementedusing the GBMT system andotherinformation retrieval tools (Callanet al. 92). Glossary-Based MT GBMT systemsput an emphasison a feature of natural languagethat is frequently overlookedby (computational) linguists, i.e., the fact that the meaning of a sentenceis very often non-compositional. Idioms, collocations and specialized terminology,whichoccur muchmoreoften than is usually assumedin real texts, must all be listed individuallyin the lexical databaseof the system.This is the very reasonwhyolder systems,whosedictionaries and other lexical resourceshavefrequentlybeendevelopedover several decades, still out-performnewersystemsusing better technology. GBMT, as exemplified by the Temple system, is a suitable paradigmfor MTfor assimilation. Usingfinitestate technologythroughout, a GBMT system can be both fast andcompact.Byusingvariouslevels of defaultsin case of failure at a higherlevel, translationis also very robust. Moreover, these types of systems are mucheasier to developthan moretraditional MTsystems; all the basic resourcesusedby the systemcan be presentedto a user in a very intuitive anddeclarativeway,greatly facilitating the maintenance andevolutionof the system. Developinglexical resources for Mr systems is the highest-cost component for any type of system. Automating the acquisition of such resources is therefore a very importanttopic, addressedin current MTresearchunderthe learningor inductionparadigms.Therestriction to finitestate technologyshould bring important advantagesfor leamin£since induction algorithmsfor regular languages givemuchbetter results thanfor other classes of languages. Glossary-BasedMachineTranslatien (GBMT) was first developedat CarnegieMellonUniversity as part of the Panglossproject (Cohenet al. 93, Nirenburget al. 93, Frederking et al. 93), and a sizeable Spanish-English GBMT system was implemented.The Templeproject has built upon this experience and extended the GBMT approach to other languages: Japanese, Arabic, and Russian.This experiencewith other languageshas provided significant insights for the developmentof a versatile GBMT engine and for the use of off-the-shelf components for building a completeMTsystem. Building a generic platform for integrating various MTsystemsin a single flexible user environment,built uponthe Tipster document architecture, has also been a valuable experience for developinggeneric natural languageprocessing support systems. The User Perspective TheTempleAnalyst’s Workstationis incorporated into a Tipster documentmanagement architecture and allows both lxanslator/analysts andmonolingual analysts to use the Mr function for assessing the relevance of a t~anslated documentor otherwise using its information in the performanceof other types of information processing. Translators can also use its output as a roughdraft from whichto begin the process of producinga translation, followingup with specific post-editing functions. Theuser (translator or analys0can: ¯ Browsea collection of documentsmanagedby a Tipster DocumentManagerusing a Collection/ Documentbrowser, ¯ Viewand edit foreign languagedocumentsor their Englishtranslations using the multilingualTipster Editor for I)oonnents, ¯ Translate foreign documentsusing the generic translationfunction, ¯ Browseand edit lexical resources (bilingual dictionaries andglossaries) usingthe multilingual TempleLexicalEditor. Although the translation providedby the systemis only a phrase-for-phrase(or word-for-word) gloss of the original, the systemis entirely underthe control oftheuser whocan modifyanyessential part of it, i.e., the dictionaries and glossaries. Froma user’s point of view, the system is predictable, responsive, affordable and easy to use and maintain. Like advancedMr systems, it uses reliable morphologicalprocessors and taggers, componentswhich are relatively inexpensive,requirelittle or no maintenance, and greatly enhanceoutput quality of GBMT. The MTDeveloper Perspective A Multilingual Architecture. The Templearchitecture uses a multilingualtext library developedat CRLto suppc~ multilingualtext processing.Thislibrary, availablefor Unix systems,is capableof handlinga large number of character codesets and provides muitilingual string processing functionalities and character codeconversionfor a large variety of codesets. AmultilingualMotiftext widgetthat can be embedded in higher-level applications (as in the TempleLexical Editor) and a simple muitilingual text editor are also supported.This library provedto be a major asset for the project since fewcomparablefunctionalities for the range of languagesprocessedin the Templeproject are availablein the Unixenvironment. 150 Althoughfull Unlcodesupportwasa goal of the project, this could not be achievedentirely. Currently. only the Arabic morphological analyzer has built-in Unicode support(as well as other codesets);however, a full Unicode library shouldbe availablefor use in the Corelli project(see below). Reuseof Machine-Readable Dictionaries. Bilingualdictionaries are processedversions of various machine-readabledictionaries (MRDs),for example,the Collins Spanish-English dictionary (or any other MT dictionaries that have been restructured to conformto Temple’slexical format, see Stein et al. 93). Since morphologicalanalyzers and dictionaries maycomefrom different sources, they mayhave incompatible lexical representation, as it happenedfor the Japanese-English dictionary and the Japanese morphologicalanalyzer. In such cases, integration is achieved by mapping the dictionary, including, for example, part-of-speech information,to a standardizedformat, and by developinga filter that mapsthe morphological analyzerresults to that structure. Reuse of Natural LanguageProcessing Components. Animportantdecision in the Templeproject wasto use available NLPcomponents and resources whenever possible (Farwell 94, Penman88, Matsumoto 93). This led to the definition of an openarchitecture that provides support for integrating external tools. SomeTemple componentshave been developedas part of the project (e.g., the ArabicandRussianmorphological analyzers) but havebeen integrated using the samemethodology as other tools. TheTemplearchitecture is built arounda nni~olai internal canonical linguistic representation for all languages;all components read or maptheir resnlts to this representation, whichuses Tipster annotations. All NLP tools are encapsulatedusing Tcl wrappersfor mappingthe tool representationto the Temple representation. Morphological informationis transferred fromthe source to the target languageusingmorphological transfer tables that mapcategories andfeatures froma sourcelexical item to the equivalent English lexical item. The GBMT engine itself is fully generic andis parametrizedby a bilingual glossaryanda morphological transfer table. Semi-automatic Development of Glossaries, Small glossaries (between 2,000 and 20,000 entries dependingonthe language) have beendevelopedfor each lan~age.Theacquisition processis as follows: 1. An Ngramextraction programis used to collect recurrentwordpatterns in a givencorpus(Daviset al. 95); 2. This set of patterns is loaded in the Lexical Databaseas partial glossaryentries; 3. Thetranslation of each entry is addedmanually. The development cost of such glossaries remains relatively low since the structure and the information encodedin a glossaryentry is very simple. MTfor the Web Althoughthe Templesystem wasnot originally designed for Webbrowsing,it wasneverthelessdesignedfor analysts browsinglarge numbersof relatively small, a fewpagesat most, heterogeneousdocuments,in different domains,a variety of styles, and different languages. The characteristicsof the texts are quite close to mostof whatis foundin Web sites, andthe analyst task is also close to the typical behaviorof a Websurfer. Totest the adequacyof this hypothesis, we developeda Webfront-end for the Temple MTsystem which includes an ~ parser. This experimentis describedhereafterin the followingsection. TheTemplesystemis a laboratory prototype whichwas used to demonstratethe feasibility of the approachon a realistic scale. In a secondphase, whichstarted during snmmer1996, the Corelli project is re-implementingsome key componentsof the system and developingthe Temple technologyin the followingdirections: ¯ Developinga newNLParchitecture to serve as the integration layer of a multi-engineMrsystem(in cooperationwith the Mikrokosmos projec0. ¯ Developingand extending the pattern-matching technologythat is at the core of the GBMT system. ¯ Developing a comprehensive set of lexical acquisition tools for the automateddevelopment of glossariesandother lexical resources. ¯ Developinga Translation Editor integrated in a DocumentPreparation System(FrameMaker). ¯ Applyingthis technologyto newlanguagepairs. The Temple Web Translator The Websurfer is not interested in MT,only in the document’scontent, andany impediment to straightforward access is considereddetrimental. Therefore,the MTsystem shouldbe as unobtrusiveas possible. In the ideal scenario, both browsersand servers wouldsupportrelevant features of the HTI’Pprotocol, i.e., languagenegotiation(only a few currently provide this functionality), and the charset and languageattributes. Thesefeatures are used in building applications that automatically select the appropriate languageandcharacterset for display. These features however do not help to process a document mixing several languages (e.g., for many JapaneseWebpages) and character sets, somethingwhich is unavoidable when, for example, an unknownforeign wordneeds to be inserted in the Englishtranslation. The 151 current HTML standard does not provide support for multilingual documents, although the World-WideWeb Consortium OV3C) has issued a draft on the internationalization of HTML (the so-called ’ilSn’ draft, see Nicol et al. 96) andproposesthat a future version of HTML support Unicodeand a ’lang’ attribute (see also CarrascoBenitez 96 for an overviewof issues related to il8n and multilingualism). In the ideal scenario,onthe user side, a user wouldset an ordered llst of preferred languagesfor browsingand get back the documentin one of the languagesrequested. On the server side, the server wouldautomaticallyselect the appropriateversionif availableat the site. If the document is not available in any of the languagesspecified by the user. there are twopossibilities for translating a document in a requested language since the MTsystem could be locatedat the server or at the browser.If the MTsystemis located at the server site, the server could send the MT systema documentin one of the languagesacceptedby the system and get a translation in one of the languages requestedby the user. If the MTsystemis located at the user site. languagenegotiation can be usedby the browser for requesti,~ a translation beforedisplayingthe document. There are also classical problems in relating and displayinga document and its translation. Thesourceand its translation couldbe displayedside-by-sidein different browsers, side-by-side in the samebrowser, or using an interlinear modelwithsentencelevel translations occurring adjacent to corresponding source sentences. Several proposals have already been madeto extend HTPPand 3HTML to handlethis set of problems(e.g., Bryan96). Whilewaitingfor full supportof the i18u draft, wehave implementedan interface in which a user can send a translation request to the TempleTranslationServer: the user provides the URLand the Translation Server sends backthe Fnslish translation of the foreign Webpagein a newbrowser. TheTempleWebTranslation architecture is pictured in Figure 1. A Webbrowserdisplayingthe Javascript-enabled WebTranslatorTool(Figure 2) allowsthe user to enter the URLof a documentto be translated. Thetool generates a translation request for the indicated document,whichis delivered to CRL’sTempleTranslation Server via CRL’s Webserver and a CGIscript. The TempleTranslation Server retrieves the documentfrom the Web.parses the documentstructure, translates its textual contents, and returns the translated document, with original HTML formatting intact, to the user. The newlytranslated documentis then displayedin a separate browserwindow. CTranslatorTool),~ URL runs ated Document ~CRLWEBServ!r~ ~L- Translated~ocument | (WebServer ent Translated !eument ~ouree D;L( CGI Script TempleTranslationServer-~ / I Codeset ITranslationlLanguage/ ~onverter [ Engines [Recogmze~// Figure 1: CRL WebTranslationSystemArchitecture. The current implementation of the WebTranslation Systemalso supports delivery of a translated documentto the user via e-mail.In the future, wewill also providean emall interface to the Translation Systemto allow for asynchronous access: as currently envisiened,the user will be able to submit either a URLor a mime-encoded document for translation. For translating Webpages from any site in the world, several problemsneedto be addressed.After retrieving the selected document fromthe Web,the TranslationServer is faced with the tasks of languageandcodesetdetermination, possible codeset conversion,and HTML parsing. First, the server mustdeterminethe document’s language and codeset characteristics. Our first WebTranslation System implementation assumes that the HTTPserver delivering the documentcorrectly returns the document’s codesetin the H’ITPheader. If the codesetof the document can not be determinedor does not appear to matchone of the supportedtranslation languages,a correspondingerror is returned to the user. If the user is confident that the documentis indeedin one of the supportedlanguages,he/ she mayspecify the correct languageand codeset via the TranslatorTooland resubmitthe request. 3. All referenceson internationalizationof the Webcan be obtainedat the W3C website, http://www.w3.org. 152 ~~iTramlator ,4. .:! ........ l I HTML version 3.2 is used to extract the textual content of the documentfor translation. Theonly formatting that might be lost or misaligned would be non-strnctural typefacingtags due to non word-for-word translation. For instance, the original documentmighthavea single word highlightedin a groupof wordstranslated as a phrasalunit. Sinceit is often difficult if not impossibleto associatea specific sourcewordwith its equivalenttarget word,this type of formattingwouldbe lost. Conclusion The emergenceof efficient and low-cost MTtechnology combinedwith advancesin MTarchitecture (interlinguas, reversibility) make low-cost multilingual MTfor assimilationpurposesattractive, andthe rapid emergence of multinational Websites and accompanyingtranslation needsseemsto be an excellent opportunityto deploysuch technology.This is not to say that all problemshavebeen solved, but there are already MTsystems using mature elementsof this technology. Since all elementsof this technologyhaveprovenfeasible, webelieveit nowpossible to designa MTarchitecture for assimilation purposesthat can support rapid development of multilingual MTsystems. TheCorelli project4 is aimedexactlyat this. Figure2: WebTranslator ToolGUI. Oncethe codeset of the documentis known,it will be possible to determinethe associatedlanguagefor many,if not all, languages.For somedocuments,the languagemode will be included in the HTrPheader by the Webserver. Whenthis is not the case. manydocuments will be encoded usingcodesetsthat are uniquelyassociatedwith a language. The remaining (difficulO documentsthen will be those encodedusing codesets such as ISO-8859-1 (ISO-latinl), whichcorrespondto multiple languages. Thetranslation server will attemptto disambiguate these cases by looking first for a languageMEFA tag within the documentitself. Whereno suchtag exists, the server will attemptto utilize CRL’s statistical languageguesserto select the mostlikely languagegiventhe codeset. Thesecondproblemfacing the TranslatienServer is the needto present the document to the TranslationEnginein oneof the supportedcodesets.If the document’s codesetis not one of those supportedby the particular language’s translation engine, then the document will be convertedby CRL’scodeset conversion tool as appropriate. CRL’s conversion tool currently supports conversion between approximately 90 of the most frequently used codesets. Finally, the TranslationServermustpreserve,as closely as possible, the original HTML formattingof the document duringthe translation process. AnHTML parser supporting Acknowledgments.Theauthors wish to thank Sergei Nirenburg,Jim Cowie,Bill Ogdenand MichelleVannifor their help and support, and for their contribution to the Temple project. Wewould like to acknowledge the collaboration of AhmedMalld, Vanishree Mahesh,Nick Ourusoff, Heather Pfeiffer, SusumuDukeYasuda, Yin Wanying,Daniel Wood,and also manyother students who havecontributedto this project. Finally, wewouldalso like to thanks anonymousreviewers for their helpful and constructive comments.The Templeproject has been funded by the USDoDunder grant 94-R-3075/A0001. References Callan. J.P.; Croft, W.B.; and Harding, S.M. 1992. The INQUERY Retrieval System. In Proceedings of the 3rd International Conferenceon Databaseand Expert Systems Applications,78-83.Springer-Verlag. Carrasco Benitez, M.T. 1996. "Winter: Web Internationalization & Multilinguism". INTERNEFDRAFT <draft-benitez-winter-cultures-00.txt>, May16th, 1996. Church, K. and Hovy, E. 1993. GoodApplications for Crnmmy MachineTranslation. MachineTranslation 8(4): 239-258. 4. http://crl.nmsu.edu/eorelli. 153 Bryan, M. 1996. LinkingHTML Translations. Version2.0, The SGML Centre. Davis, M.; Dunning, T.; and Ogden, B. 1995. String Matching Strategies and N-Gram Comparisons. In Proceedingsof the 7th Conferenceof the EuropeanChapter of the Associationfor ComputationalLinguistics, 27-31. Dublin,Ireland: UniversityCollegeDublin,Beltield. Frederking,R.; Grannes,D.; Cousseauo P.; and Nirenburg, S. 1993. An MATTool and Its Effectiveness. In Proceedings of the DARPA HumanLanguageTechnology Workshop. Princeton, NJ. Farwell,D.; I-Ielmreich,S.; Jin, W.;Casper,M.; Hargrave, J.; Molina-Salgado, H.; and Weng,F. 1994. Panglyzer: SpanishLanguageAnalysis System.In Proceedingsof the Conferenceof the Associationof MachineTranslation in the Americas,56-64. Columbia,MD. Grishman,R. ed. 1995. Tipster Phase 1T Architecture Design Document,Version 1.52. New-YorkUniversity. (http://cs.nyu.edu/tipster) Kay, M. 1980. The Proper Place of Menand Machinesin MachineTranslation. XeroxPARC Technical Report, CSL80-11. 1988. The PenmanPrimer, User Guide, and Reference Manual.UnpublishedUSC/ISIdocumentation. Matsumoto,Y.; Sadao, K.; Takeji, U.; Yutaka, M.; and Makoto, N. 1993. Japanese Morphological Analysis System, JUMAN. Kyoto University, Nara Science and Technology GraduateSchoolUniversity(in Japanese). Nicol, G.; Yergeau,F.; Adams,G.; and Duerst, M. 1996. Internationalizatien of the HypertextMarkupLanguage. Internet Draft <draft-ietf-html-i18n-05.txt>, Network WorkingGroup. Nirenburg,S.; Shell, P.; Cohen,A.; Cousseau,P.; Grammes, D.; and McNeilly, C. 1993. Multi-purtxx~ Development a.d OperationsEnvironments for Natural Language Applications. In Proceedingsof the 3rd Conferenceon Applied Natural L~n~ageProcessing, page nos. Trento, Italy. Stein, G.C.; Lin, E; Bruce, R.; Weng,E; and Guthrie, L. 1993. The Developmentof an Application Independent Lexicon: LexBase.CRLTechnical Report, MCCS-92-247. The Unicode Consortium. 1996. The UnicodeStandard, Version2.0. Reading,Mass:Addison-Wesley. Zajac, R. and Vanni, M. 1996. Glossary-BasedMTEngines in a MnitilingualTranslat~’s Workstationfor Information Processing. MachineTranslation, Special Issue on New Tools for Human Translators. Forthcoming. 154