Proposal to build a UNL-Japanese deconverter & and a Japanese-UNL enconverter at GETA Mutsuko Tomokiyo mutsuko.tomokiyo@imag.fr GETA-CLIPS-IMAG1) & ENST2) 1) 385 rue de la Bibliothèque, BP53, campus universitaire, 38041, Grenoble cedex 9, France 2) 46 rue Barrault, 75634 Paris, cedex 13, France Introduction This document aims to formalize our motivations, first, to build a deconverter of UNL graphs to Japanese utterances as final goal. Deconversion into Japanese has been already carried out by the UNL center in Japan, by using the DeCo programming language. We think, however, the Japanese deconverter has not yet been operational like other languages, although about 44.89 % of whole population are internet users in Japan 1 on the one hand, and on the other hand, it is worth trying to experiment with several ways to build a deconverter when dealing with a language like Japanese, which has large disparity in syntax, semantics and pragmatics from European languages. As far as indo-european languages are concerned, the problems they pose to NLP are quite similar, and several methods and tools are already being tried by various UNL language centers, for French, Russian, Italian, and Portuguese. It’s the key reason for proposing to build an UNL-Japanese deconverter. At GETA, we use a machine translation (MT) environment called Ariane-G5, where R&D on UNL to French deconversion has been carried out since 1997 [1] [2]. More recently, work on a UNL to Chinese deconverter has been started, and is currently conducted by a French senior researcher and a Chinese PhD. student. Hence, we will be able to apply the store of knowledge accumulated through these works to the processing of Japanese, although the linguistic phenomena differ between the three languages, and difficulties we would meet reside in different steps from the two languages. Second, we propose to make a small experiment of automatic encoding into UNL from Japanese language, based on an inductive learning method. For this, we will develop a corpus containing 500 pairs of Japanese utterances and UNL graphs, and test the feasibility of automatic encoding into UNL, using a MT system based on the recursive-chain-link (RCL) learning method [4], which has been developed at Araki laboratory at Hokkaido University and Echizenya laboratory at Hokkai Gakuen University in Japan [3]. The encoding (a synonym for "enconversion") of natural language utterances into UNL graphs is semiautomatically made by human experts for all languages at the present time. Hence, to produce entirely automatic enconversion methods is an important issue for UNL users, and each member of the UNL society should eventually tackle it. We describe, in section 2, a generation method to Japanese and a method of automatic encoding (enconversion) into UNL. In section 3, we propose a domain to work on. Working steps and its timetable are shown in section 4, following section 5 and 6, where we describe persons concerned by these works and we ask a grant from the Spanish UNL center. 1 According to a report from Statistic Bureau in Japan in 2002, the number of internet users is : 55.14% of whole populations in USA, 42.31% in England, 31,38% in France, 35.24% in Italy, 4.09% in Russia [10]. 1/9 Proposal to build a UNL-Japanese deconverter and a Japanese-UNL enconverter 1 1.1 Exploratory document Methodology Generation of Japanese on the ARIANE-G5 environment Ariane-G5 is a generator of MT systems based on a transfer approach, and several software and lingware components have been added to the original system in order to use it for the UNL-French deconversion. Deconversion from UNL graphs to a target language is performed in three successive steps on Ariane-G5 as shown in Fig.1: analysis, transfer, and generation. EXPANS TX transfer X TL : lexical transfer : TS : structural transfer expansive ROBRA EXPANS TY : expansive transfer EXPANS TRANSFER interactive disambiguisation module AS : structural anaysis ROBRA EXPANS EXPANS ROBRA AY : expansive analysis Y AX : expansive analysis X EXPANS EXPANS AM : morphological anaysis SYGMOR ATEF source text GX : expansive generation X GS : syntactic generation GY :expansive generation Y GM : morphological generation GENERATION ANALYSIS target text Fig. 1 The Ariane system2 UNL graphs embody morphological and structural analysis results. In this approach, the UNL graphs themselves are the input to the transfer phases. The transfer on Ariane-G5 resides in, however tree-to-tree conversion, so UNL graphs are converted into tree structures [11]. e.g. (a) ポールはステファニーにバラを贈った。(Japanese text, Poru-ha sutefani-ni bara-wo oku-tta) 2 The process flow is referred to [2]. The phases in boxes woth broken line contour are optional processes. 1 2/9 Proposal to build a UNL-Japanese deconverter and a Japanese-UNL enconverter Exploratory document agt(present(icl>do).@entry.@past, Paul(icl>name)) obj(present(icl>do).@entry.@past, rose.@pl) goal(present(icl>do).@entry.@past, Stéphanie(icl>name)) (UNL graphs) The tree structures for the example sentence (a) is as follows : The transfer phases consist in making lexical and structural transfer to target language. The UW dictionary will give lexical and structural information necessary for the transfer. The Japanese UW dictionary for the example will be as follows: present : CAT(CATV),GP2(A),VAL1(GN-wo),VAL2(GN-ni), AUXP(-tta) : 贈る (oku-ru) rose : CAT(CATN), N(NC) : バラ(bara) [8] In the example sentence, there is no syntactical disparity between the valency of Japanese verb ( 贈る) and one of UW (present), hence simple tranfer is done without any problem. In generation phases, syntactic and morphological generations are performed also by using UW dictionary. The verb “贈る” has the “wo”-case (VAL1 in the dictionary) and “ni”-case (VAL2 in the dictionary) for its constituents, so that “wo ( を)” and “に” are added to noun phrases, respectively as syntactic generation, and finally tense information of the sentence is given as morphological generation. “贈る” has a verb of standard conjugation type, but the pronounciation change in form of past tense, so information “AUXP(-tta)” is given in the dictionary. There is neither Japanese-UW dictionary nor linguistic information at the moment on Ariane-G5. So we will start the work with editing Japanese dictionary. Also, it is not possible to visualize Japanese characters neither on Ariane-G5 nor in the UW dictionary at the present time, so we will use ASCII code instead of inputing directly Japanese strings. We will make a list of ASCII codes corresponding to Japanese characters. 1.2 Automatic encoding of Japanese text into UNL graphs Almost all of learnning type MT systems require large amounts of translation examples [5,6,7]. To overcome this problem, a method using a recursive-chain-link (RCL) learning method has been proposed [3] and applied to GA-ILMT (Machine Translation using Inductive Learning with Genetic Algorithms) system at Araki laboratory [12]. GA-ILMT resides in a machine translation using inductive learning with genetic algorithms, which imitate the evolutionary process, which repeats generational replacement to adapt to the environment [13]. When English-Japanese MT being made on the system, first a uer inputs a source sentence in English, second, in the translation process, the system produces several candidates as translation results. Third, the user proofreads the translated sentence. Forth, in the feedback process, the system determines the fitness value and performs the selection process of translation rules. In the learning process, new translation examples are automatically produced by “crossover and mutation” operations, and various translation rules are extracted from the translation examples by indctive learning [13]. The process flow is shown in the Figure 2. 1 3/9 Proposal to build a UNL-Japanese deconverter and a Japanese-UNL enconverter Exploratory document source sentence translation process translation result proofreading correct translation result Dictionar y for translation rules feedback process learning process Fig.2 Process flow [4] The advantage of the RCL learning method consists in acquiring new translation rules from sparse data using other already acquired translation rules. In fact, it’s very difficult to obtain a lot of Japanese utteranceUNL graph pairs. We think, however, in this way, about 500 pairs of Japanese utterances and its UNL graphs are sufficient to make encoding experimentations to UNL graphs from Japanese. The following is the mechanism of Recursive Chain-link-type Learning (RCL) RCL type learning on GA-ILMT produces “part translation rules” and “sentence translation rules” by turns. The details of the process of acquisition of “part translation rules” are as follows : 1. The system selects translation examples that have common parts with the “sentence translation rules”. 2. The system extracts the parts that correspond to the variables in the source parts and in the target parts of the “sentence translation rules” from the source language (SL, here Japanese utterances), and the target language (TL, here, it’s UNL graphs.) of the translation examples. 3. The system registers pairs of the parts extracted from the SL and the parts extracted from the TL, as the “part translation rules”. 4. The system gives the correct and erroneous frequencies of the “sentence translation rules” to the acquired the “part conversion rules”. When the pairs (a) and (b) exist, a part translation rule is produced by using the sentence translation rule. The Greek character αindicates a variable. e.g. (a) Paul は Stéphanie にバラを贈った。(Japanese text, Poru-ha sutefani-ni bara-wo oku-tta) agt(present(icl>do).@entry.@past, Paul(icl>name)) obj(present(icl>do).@entry.@past, rose.@pl) goal(present(icl>do).@entry.@past, Stéphanie(icl>name)) 1 4/9 UNL graphs Proposal to build a UNL-Japanese deconverter and a Japanese-UNL enconverter Exploratory document (b) Paul は Stéphanie にバラを買った。(Japanese text, Po-ru-ha sutefani-ni chokore-to-wo oku-tta) agt(buy(icl>do).@entry.@past, Paul(icl>name)) obj(buy(icl>do).@entry.@past, rose.@pl) goal(buy(icl>do).@entry.@past, Stéphanie(icl>name)) UNL graphs - Translation example : ( Paul は Stéphanie にバラを贈った。; agt(present(icl>do).@entry.@past, Paul(icl>name)) obj(present(icl>do).@entry.@past, rose.@pl) goal(present(icl>do).@entry.@past, Stéphanie(icl>name))) - Sentence transaltion rule : (Paul は Stéphanie に αを買った。; agt(present(icl>do).@entry.@past, Paul(icl>name)) α goal(present(icl>do).@entry.@past, Stéphanie(icl>name)) - Part translation rule : (バラ; rose.@pl) The details of the process of acuisition of sentence translation rules are as follows : 1. The system selects the “part translation rules” in which the source parts are included in the SL of the translation example or in the source parts of “sentence translation rules”, and in which the target parts are included in the TL of the translation examples or in the targhet parts of “sentence translation rules”. 2. The system acquires new “sentence translation rules” by replacing the parts, which are same as the “part translation rules” with the variables to the conversion examples or the “sentence conversion rules”. 3. The system gives the correct and erroneous frequencies of the “part translation rules” to the acquired “sentence translation rules”. [4] e.g. Part translation rule : (バラ; rose.@pl) Given translation example (ポール はステファニーにバラを買った。; agt(buy(icl>do).@entry.@past, Paul(icl>name)) obj(buy(icl>do).@entry.@past, rose.@pl) goal(buy(icl>do).@entry.@past, Stéphanie(icl>name))) Aquisition of the sentence translation rule using the part translation rule (ポール はステファニーに αを買った。; agt(buy(icl>do).@entry.@past, Paul(icl>name)) α goal(buy(icl>do).@entry.@past, Stéphanie(icl>name))) 1 5/9 Proposal to build a UNL-Japanese deconverter and a Japanese-UNL enconverter Exploratory document Part translation rule (ポール ; Paul(icl>name)) Acquired sentence translation rule (ポール はステファニーにバラを買った。; agt(buy(icl>do).@entry.@past, Paul(icl>name)) obj(buy(icl>do).@entry.@past, rose.@pl) goal(buy(icl>do).@entry.@past, Stéphanie(icl>name))) Recursive acquisition of the sentence translation rule using the part translation rule (αはステファニーに αを買った。; α α goal(buy(icl>do).@entry.@past, Stéphanie(icl>name))) There are several types of the translation rule. The translation rule without any variable is : - The translation rule for SL (ポール はステファニーに にバラを買った。: agt(present(icl>do).@entry.@past, Paul(icl>name)) obj(present(icl>do).@entry.@past, rose.@pl) goal(present(icl>do).@entry.@past, Stéphanie(icl>name))) - The translation rules for words (ポール : Paul(icl>name)) (ステファニー: Stéphanie(icl>name)) (バラ: rose.@pl) - The translation rule with only one variable (ポールはαにバラを贈った: agt(present(icl>do).@entry.@past, Paul(icl>name)) α obj(present(icl>do).@entry.@past, rose.@pl)) - The translation rule with two variables (ポールはαα贈った : agt(present(icl>do).@entry.@past, Paul(icl>name)) αα) 2 Domain It’s not too much to say people in all over the world are talking about ecological environment or bioproducts on media. Namely Japanese people are sensitive to it and they constantly engage with its problems. On the other hand, it’s slightly difficult for them to obtain information or references on theoretical aspects, commercial development or people’s movements in foreign countries, if they don’t know the languages of information source. It’s the key reason for proposing to choose a domain of biology and ecology (bio-eco domain) as our first approach to Japanese generation from UNL graphs. e.g. 1 6/9 Proposal to build a UNL-Japanese deconverter and a Japanese-UNL enconverter Exploratory document Agriculture biologique Pourquoi l'agriculture biologique ? J'ai un jour pris conscience qu'un aliment est quelque chose que j'introduis à l'intérieur de mon corps. Il ne me viendrait pas à l'idée d'introduire du carburant souillé dans ma voiture sans craindre une panne, et pourtant, j'ai pendant de trop nombreuses années introduit dans mon corps des denrées alimentaires comprenant des composés indésirables : additifs, pesticides et autres. Une étude de l'INRA indique qu'un français moyen en ingère près d'1,5 kg par an ! [9] translation : 有機農業 なぜ有機農業か。 ある日わたしは食べ物は身体の内部に取り入れるものだと意識した。故障も恐れないで自動車 に汚れた燃料を入れるという考えはこないだろう。しかしながら、わたしは多年にわたってのぞまし くない成分を含む食料品を体内に摂取してきた。添加物、殺虫剤、など。INRIA の研究では平均的フ ランス人一人がそれらを摂取する量は一年に 1,5kg であると指摘している。 3 Work plan for 18 month Working steps and time table are roughly shown below. The duration of this project is 18 month. Corpus development : 1. choice of articles on Web site (500 sentences) 2. translation by human experts into Japanese 3. Corpus analysis II. UW Dictionary : 4. corresponding list of Japanese characters and ASCII codes 5. dictionary description III. Generation experimentations : 6. UNL graphs for Japanese articles 7. Japanese morphological rules 8. Japanese syntactic ruels for noun phrases 9. Japanese syntactic ruels for verb phrases 10. Japanese syntactic ruels for adjective and adverb phrases 11. Japanese semantic rules 12. Japanese pragmatic rules IV. Encoding experimentations : 13. GA-ILMT preparation 14. learning experimentations 15. encoding experimentations I. 1 7/9 Proposal to build a UNL-Japanese deconverter and a Japanese-UNL enconverter MONTH-YEAR Exploratory document TASK deconverter May-2005 1. choice of articles on Web site (500 sentences) June 2. translation by human experts into Japanese July 3. corpus analysis enconverter August September 4. corresponding list of Japanese characters and ASCII codes October 5. UW dictionary description November December 6. UNL graphs for Japanese articles 13. GA-ILMT preparation January-2006 February March 14. learning experimentations 7. Japanese morphological rules April May 8. Japanese syntactic ruels for noun phrases June 9. Japanese syntactic ruels for verb phrases July 10. Japanese syntactic ruels for adjective and adverb phrases August 11. Japanese semantic rules September 12. Japanese pragmatic rules 15. encoding experimentations October 4 Persons concerned Prof. Christian Boitet, GETA-CLIPS-IMAG & Université Joseph Fourier Prof. Kenji Araki, Language Media Laboratory, Hokkaido University Dr Mutsuko Tomokiyo, GETA-CLIPS-IMAG & ENST (to be supported by the proposed grant) Dr Etienne Blanc, GETA-CLIPS-IMAG 5 Support We will work at CLIPS laboratory in France as one of joint researches with Araki laboratory at Hokkaido University in Japan. We would like to thank the Spanish UNL centre and Professor Jesus Cardeñosa for his kind offer to find support for Dr. Tomokiyo’s in the form of a grant or equivalent means. The grant to be asked is 2000 euros/month* 18month = 36000 euros, and “Association Champollion” at CLIPS is entrusted with the management. 6 References [1] Sérasset G. and Boitet, Ch., UNL-French deconversion as transfer & generation from an interlingua with possible quality enhancement through offline human interaction, MT-SUMMIT VII, Singapore, 1999 [2] Boitet Ch., GETA’s MT methodology and its current development towards personal networking communication and speech translation in the context of the UNL and C-STAR projects, Proc. of PACLING-97, 1997 1 8/9 Proposal to build a UNL-Japanese deconverter and a Japanese-UNL enconverter Exploratory document [3] http://sig.media.eng.hokudai.ac.jp/~araki/araki_english.html [4] Echizenya H., Araki K., Momouchi Y., Tochinai K., Study of Practical Effectiveness for Machine Translation Using Recursive Chain-link-type Learning, Proc. of COLING-2002, Taipei, 2002. [5] Sato S., and Nagao, M., Towards Memory-based Translation, In proc. of the Coling90, 1990 [6] Brown P.J. and al., A Statical Approach to Machine Translation, Computational Linguiatics, vol.16, No.2, 1990 [7] McTait, K, Linguistic knowledge and Complexity in an EBMT System Based on Translation Patters, In proc. of the Workshop on EBMT, MT Summit VIII, 2001 [8] http://gohan.imag.fr/unldeco [9] http://www.eco-bio.info/accueil.html [10] http://www.stat.go.jp/data/sekai/index.htm [11] Blanc E., From the UNL hypergraph to GETA’s multilevel tree. Proceeding of Machine Translation, University of Exeter, 18-21 oct., and pp9.1—9.9. Ed. British Comp. Society, 2000. [12] Echizenya H., Araki K., Momouchi Y., Tichinai K., Application of Genetic Algorithms for Example-based Machine Translation Method Using Inductive Learning and Its Effectiveness, review of the society of the information Processing vol.37, No.8, 1996, Japan [13] Echizenya H., Araki K., Momouchi Y., Tichinai K., Machine Translation Method Using Inductive Learning with Genetic Algorithms, Proc. of COLING-1996, Copenhagen, 1996. Contents Introduction .................................................................................................................................................. 1 1 ......................................................................................................................... Methodology 2 1.1Generation of Japanese on the ARIANE-G5 environment 2 1.2Automatic encoding of Japanese text into UNL graphs 3 1.2.1 Process No.1 : ............................................................................ Error! Bookmark not defined. 1.2.2 Process No.2 .............................................................................. Error! Bookmark not defined. 1.2.3 Process No.3 .............................................................................. Error! Bookmark not defined. 2 ................................................................................................................................ Domain 6 3Working steps and timetable for 18 month 7 4Persons concerned 8 5Asking scholarship Error! Bookmark not defined. 6 ............................................................................................................................ References 8 Contents........................................................................................................................................................ 9 1 9/9