1.2 Automatic encoding of Japanese text into UNL graphs

advertisement
Proposal to build a UNL-Japanese deconverter
&
and a Japanese-UNL enconverter at GETA
Mutsuko Tomokiyo
mutsuko.tomokiyo@imag.fr
GETA-CLIPS-IMAG1) & ENST2)
1) 385 rue de la Bibliothèque, BP53, campus universitaire, 38041, Grenoble cedex 9, France
2) 46 rue Barrault, 75634 Paris, cedex 13, France
Introduction
This document aims to formalize our motivations, first, to build a deconverter of UNL graphs to Japanese
utterances as final goal. Deconversion into Japanese has been already carried out by the UNL center in Japan, by
using the DeCo programming language. We think, however, the Japanese deconverter has not yet been
operational like other languages, although about 44.89 % of whole population are internet users in Japan 1 on the
one hand, and on the other hand, it is worth trying to experiment with several ways to build a deconverter when
dealing with a language like Japanese, which has large disparity in syntax, semantics and pragmatics from
European languages. As far as indo-european languages are concerned, the problems they pose to NLP are quite
similar, and several methods and tools are already being tried by various UNL language centers, for French,
Russian, Italian, and Portuguese. It’s the key reason for proposing to build an UNL-Japanese deconverter.
At GETA, we use a machine translation (MT) environment called Ariane-G5, where R&D on UNL to
French deconversion has been carried out since 1997 [1] [2]. More recently, work on a UNL to Chinese
deconverter has been started, and is currently conducted by a French senior researcher and a Chinese PhD.
student. Hence, we will be able to apply the store of knowledge accumulated through these works to the
processing of Japanese, although the linguistic phenomena differ between the three languages, and difficulties
we would meet reside in different steps from the two languages.
Second, we propose to make a small experiment of automatic encoding into UNL from Japanese language,
based on an inductive learning method. For this, we will develop a corpus containing 500 pairs of Japanese
utterances and UNL graphs, and test the feasibility of automatic encoding into UNL, using a MT system based
on the recursive-chain-link (RCL) learning method [4], which has been developed at Araki laboratory at
Hokkaido University and Echizenya laboratory at Hokkai Gakuen University in Japan [3].
The encoding (a synonym for "enconversion") of natural language utterances into UNL graphs is semiautomatically made by human experts for all languages at the present time. Hence, to produce entirely automatic
enconversion methods is an important issue for UNL users, and each member of the UNL society should
eventually tackle it.
We describe, in section 2, a generation method to Japanese and a method of automatic encoding
(enconversion) into UNL. In section 3, we propose a domain to work on. Working steps and its timetable are
shown in section 4, following section 5 and 6, where we describe persons concerned by these works and we ask
a grant from the Spanish UNL center.
1
According to a report from Statistic Bureau in Japan in 2002, the number of internet users is : 55.14% of whole
populations in USA, 42.31% in England, 31,38% in France, 35.24% in Italy, 4.09% in Russia [10].
1/9
Proposal to build a UNL-Japanese deconverter and a Japanese-UNL enconverter
1
1.1
Exploratory document
Methodology
Generation of Japanese on the ARIANE-G5 environment
Ariane-G5 is a generator of MT systems based on a transfer approach, and several software and lingware
components have been added to the original system in order to use it for the UNL-French deconversion.
Deconversion from UNL graphs to a target language is performed in three successive steps on Ariane-G5
as shown in Fig.1: analysis, transfer, and generation.
EXPANS
TX
transfer X
TL : lexical transfer
:
TS : structural transfer
expansive
ROBRA
EXPANS
TY : expansive transfer
EXPANS
TRANSFER
interactive
disambiguisation module
AS : structural anaysis
ROBRA
EXPANS
EXPANS
ROBRA
AY : expansive analysis Y
AX : expansive analysis X
EXPANS
EXPANS
AM : morphological anaysis
SYGMOR
ATEF
source
text
GX : expansive generation X
GS : syntactic generation
GY :expansive generation Y
GM : morphological generation
GENERATION
ANALYSIS
target text
Fig. 1 The Ariane system2
UNL graphs embody morphological and structural analysis results. In this approach, the UNL graphs
themselves are the input to the transfer phases. The transfer on Ariane-G5 resides in, however tree-to-tree
conversion, so UNL graphs are converted into tree structures [11].
e.g.
(a) ポールはステファニーにバラを贈った。(Japanese text, Poru-ha sutefani-ni bara-wo oku-tta)
2
The process flow is referred to [2]. The phases in boxes woth broken line contour are optional processes.
1
2/9
Proposal to build a UNL-Japanese deconverter and a Japanese-UNL enconverter
Exploratory document
agt(present(icl>do).@entry.@past, Paul(icl>name))
obj(present(icl>do).@entry.@past, rose.@pl)
goal(present(icl>do).@entry.@past, Stéphanie(icl>name))
(UNL graphs)
The tree structures for the example sentence (a) is as follows :
The transfer phases consist in making lexical and structural transfer to target language.
The UW dictionary will give lexical and structural information necessary for the transfer. The Japanese
UW dictionary for the example will be as follows:
present : CAT(CATV),GP2(A),VAL1(GN-wo),VAL2(GN-ni), AUXP(-tta) : 贈る (oku-ru)
rose : CAT(CATN), N(NC) : バラ(bara) [8]
In the example sentence, there is no syntactical disparity between the valency of Japanese verb ( 贈る) and
one of UW (present), hence simple tranfer is done without any problem.
In generation phases, syntactic and morphological generations are performed also by using UW dictionary.
The verb “贈る” has the “wo”-case (VAL1 in the dictionary) and “ni”-case (VAL2 in the dictionary) for
its constituents, so that “wo ( を)” and “に” are added to noun phrases, respectively as syntactic generation, and
finally tense information of the sentence is given as morphological generation. “贈る” has a verb of standard
conjugation type, but the pronounciation change in form of past tense, so information “AUXP(-tta)” is given in
the dictionary.
There is neither Japanese-UW dictionary nor linguistic information at the moment on Ariane-G5. So we
will start the work with editing Japanese dictionary. Also, it is not possible to visualize Japanese characters
neither on Ariane-G5 nor in the UW dictionary at the present time, so we will use ASCII code instead of
inputing directly Japanese strings. We will make a list of ASCII codes corresponding to Japanese characters.
1.2
Automatic encoding of Japanese text into UNL graphs
Almost all of learnning type MT systems require large amounts of translation examples [5,6,7]. To
overcome this problem, a method using a recursive-chain-link (RCL) learning method has been proposed [3] and
applied to GA-ILMT (Machine Translation using Inductive Learning with Genetic Algorithms) system at Araki
laboratory [12]. GA-ILMT resides in a machine translation using inductive learning with genetic algorithms,
which imitate the evolutionary process, which repeats generational replacement to adapt to the environment [13].
When English-Japanese MT being made on the system, first a uer inputs a source sentence in English,
second, in the translation process, the system produces several candidates as translation results. Third, the user
proofreads the translated sentence. Forth, in the feedback process, the system determines the fitness value and
performs the selection process of translation rules. In the learning process, new translation examples are
automatically produced by “crossover and mutation” operations, and various translation rules are extracted from
the translation examples by indctive learning [13]. The process flow is shown in the Figure 2.
1
3/9
Proposal to build a UNL-Japanese deconverter and a Japanese-UNL enconverter
Exploratory document
source sentence
translation process
translation result
proofreading
correct translation result
Dictionar
y for translation
rules
feedback
process
learning process
Fig.2 Process flow [4]
The advantage of the RCL learning method consists in acquiring new translation rules from sparse data
using other already acquired translation rules. In fact, it’s very difficult to obtain a lot of Japanese utteranceUNL graph pairs. We think, however, in this way, about 500 pairs of Japanese utterances and its UNL graphs are
sufficient to make encoding experimentations to UNL graphs from Japanese.
The following is the mechanism of Recursive Chain-link-type Learning (RCL)
RCL type learning on GA-ILMT produces “part translation rules” and “sentence translation rules” by
turns.
The details of the process of acquisition of “part translation rules” are as follows :
1.
The system selects translation examples that have common parts with the “sentence translation rules”.
2.
The system extracts the parts that correspond to the variables in the source parts and in the target parts of the
“sentence translation rules” from the source language (SL, here Japanese utterances), and the target
language (TL, here, it’s UNL graphs.) of the translation examples.
3.
The system registers pairs of the parts extracted from the SL and the parts extracted from the TL, as the
“part translation rules”.
4.
The system gives the correct and erroneous frequencies of the “sentence translation rules” to the acquired
the “part conversion rules”.
When the pairs (a) and (b) exist, a part translation rule is produced by using the sentence translation rule. The
Greek character αindicates a variable.
e.g.
(a) Paul は Stéphanie にバラを贈った。(Japanese text, Poru-ha sutefani-ni bara-wo oku-tta)
agt(present(icl>do).@entry.@past, Paul(icl>name))
obj(present(icl>do).@entry.@past, rose.@pl)
goal(present(icl>do).@entry.@past, Stéphanie(icl>name))
1
4/9
UNL graphs
Proposal to build a UNL-Japanese deconverter and a Japanese-UNL enconverter
Exploratory document
(b) Paul は Stéphanie にバラを買った。(Japanese text, Po-ru-ha sutefani-ni chokore-to-wo oku-tta)
agt(buy(icl>do).@entry.@past, Paul(icl>name))
obj(buy(icl>do).@entry.@past, rose.@pl)
goal(buy(icl>do).@entry.@past, Stéphanie(icl>name))
UNL graphs
- Translation example :
( Paul は Stéphanie にバラを贈った。; agt(present(icl>do).@entry.@past, Paul(icl>name))
obj(present(icl>do).@entry.@past, rose.@pl)
goal(present(icl>do).@entry.@past, Stéphanie(icl>name)))
- Sentence transaltion rule :
(Paul は Stéphanie に αを買った。; agt(present(icl>do).@entry.@past, Paul(icl>name))
α
goal(present(icl>do).@entry.@past, Stéphanie(icl>name))
- Part translation rule : (バラ; rose.@pl)
The details of the process of acuisition of sentence translation rules are as follows :
1.
The system selects the “part translation rules” in which the source parts are included in the SL of the
translation example or in the source parts of “sentence translation rules”, and in which the target parts are
included in the TL of the translation examples or in the targhet parts of “sentence translation rules”.
2.
The system acquires new “sentence translation rules” by replacing the parts, which are same as the “part
translation rules” with the variables to the conversion examples or the “sentence conversion rules”.
3.
The system gives the correct and erroneous frequencies of the “part translation rules” to the acquired
“sentence translation rules”. [4]
e.g.
Part translation rule : (バラ; rose.@pl)
Given translation example
(ポール はステファニーにバラを買った。;
agt(buy(icl>do).@entry.@past, Paul(icl>name))
obj(buy(icl>do).@entry.@past, rose.@pl)
goal(buy(icl>do).@entry.@past, Stéphanie(icl>name)))
Aquisition of the sentence translation rule using the
part translation rule
(ポール はステファニーに αを買った。;
agt(buy(icl>do).@entry.@past, Paul(icl>name))
α
goal(buy(icl>do).@entry.@past, Stéphanie(icl>name)))
1
5/9
Proposal to build a UNL-Japanese deconverter and a Japanese-UNL enconverter
Exploratory document
Part translation rule (ポール ; Paul(icl>name))
Acquired sentence translation rule
(ポール はステファニーにバラを買った。;
agt(buy(icl>do).@entry.@past, Paul(icl>name))
obj(buy(icl>do).@entry.@past, rose.@pl)
goal(buy(icl>do).@entry.@past, Stéphanie(icl>name)))
Recursive acquisition of the sentence translation rule
using the part translation rule
(αはステファニーに αを買った。; α
α
goal(buy(icl>do).@entry.@past, Stéphanie(icl>name)))
There are several types of the translation rule.
The translation rule without any variable is :
- The translation rule for SL
(ポール はステファニーに にバラを買った。: agt(present(icl>do).@entry.@past, Paul(icl>name))
obj(present(icl>do).@entry.@past, rose.@pl)
goal(present(icl>do).@entry.@past, Stéphanie(icl>name)))
- The translation rules for words
(ポール : Paul(icl>name))
(ステファニー: Stéphanie(icl>name))
(バラ: rose.@pl)
- The translation rule with only one variable
(ポールはαにバラを贈った: agt(present(icl>do).@entry.@past, Paul(icl>name))
α
obj(present(icl>do).@entry.@past, rose.@pl))
- The translation rule with two variables
(ポールはαα贈った : agt(present(icl>do).@entry.@past, Paul(icl>name)) αα)
2
Domain
It’s not too much to say people in all over the world are talking about ecological environment or bioproducts on media. Namely Japanese people are sensitive to it and they constantly engage with its problems. On
the other hand, it’s slightly difficult for them to obtain information or references on theoretical aspects,
commercial development or people’s movements in foreign countries, if they don’t know the languages of
information source. It’s the key reason for proposing to choose a domain of biology and ecology (bio-eco
domain) as our first approach to Japanese generation from UNL graphs.
e.g.
1
6/9
Proposal to build a UNL-Japanese deconverter and a Japanese-UNL enconverter
Exploratory document
Agriculture biologique
Pourquoi l'agriculture biologique ?
J'ai un jour pris conscience qu'un aliment est quelque chose que j'introduis à l'intérieur
de mon corps. Il ne me viendrait pas à l'idée d'introduire du carburant souillé dans ma voiture sans
craindre une panne, et pourtant, j'ai pendant de trop nombreuses années introduit dans mon corps
des denrées alimentaires comprenant des composés indésirables : additifs, pesticides et autres. Une
étude de l'INRA indique qu'un français moyen en ingère près d'1,5 kg par an ! [9]
translation : 有機農業
なぜ有機農業か。
ある日わたしは食べ物は身体の内部に取り入れるものだと意識した。故障も恐れないで自動車
に汚れた燃料を入れるという考えはこないだろう。しかしながら、わたしは多年にわたってのぞまし
くない成分を含む食料品を体内に摂取してきた。添加物、殺虫剤、など。INRIA の研究では平均的フ
ランス人一人がそれらを摂取する量は一年に 1,5kg であると指摘している。
3
Work plan for 18 month
Working steps and time table are roughly shown below. The duration of this project is 18 month.
Corpus development :
1. choice of articles on Web site (500 sentences)
2. translation by human experts into Japanese
3. Corpus analysis
II.
UW Dictionary :
4. corresponding list of Japanese characters and ASCII codes
5. dictionary description
III.
Generation experimentations :
6. UNL graphs for Japanese articles
7. Japanese morphological rules
8. Japanese syntactic ruels for noun phrases
9. Japanese syntactic ruels for verb phrases
10. Japanese syntactic ruels for adjective and adverb phrases
11. Japanese semantic rules
12. Japanese pragmatic rules
IV.
Encoding experimentations :
13. GA-ILMT preparation
14. learning experimentations
15. encoding experimentations
I.
1
7/9
Proposal to build a UNL-Japanese deconverter and a Japanese-UNL enconverter
MONTH-YEAR
Exploratory document
TASK
deconverter
May-2005
1. choice of articles on Web site (500 sentences)
June
2. translation by human experts into Japanese
July
3. corpus analysis
enconverter
August
September
4. corresponding list of Japanese characters and
ASCII codes
October
5. UW dictionary description
November
December
6. UNL graphs for Japanese articles
13. GA-ILMT preparation
January-2006
February
March
14. learning experimentations
7. Japanese morphological rules
April
May
8. Japanese syntactic ruels for noun phrases
June
9. Japanese syntactic ruels for verb phrases
July
10. Japanese syntactic ruels for adjective and
adverb phrases
August
11. Japanese semantic rules
September
12. Japanese pragmatic rules
15. encoding experimentations
October
4
Persons concerned
Prof. Christian Boitet, GETA-CLIPS-IMAG & Université Joseph Fourier
Prof. Kenji Araki, Language Media Laboratory, Hokkaido University
Dr Mutsuko Tomokiyo, GETA-CLIPS-IMAG & ENST (to be supported by the proposed grant)
Dr Etienne Blanc, GETA-CLIPS-IMAG
5
Support
We will work at CLIPS laboratory in France as one of joint researches with Araki laboratory at Hokkaido
University in Japan.
We would like to thank the Spanish UNL centre and Professor Jesus Cardeñosa for his kind offer to find
support for Dr. Tomokiyo’s in the form of a grant or equivalent means.
The grant to be asked is 2000 euros/month* 18month = 36000 euros, and “Association Champollion” at
CLIPS is entrusted with the management.
6
References
[1] Sérasset G. and Boitet, Ch., UNL-French deconversion as transfer & generation from an interlingua with possible
quality enhancement through offline human interaction, MT-SUMMIT VII, Singapore, 1999
[2] Boitet Ch., GETA’s MT methodology and its current development towards personal networking communication
and speech translation in the context of the UNL and C-STAR projects, Proc. of PACLING-97, 1997
1
8/9
Proposal to build a UNL-Japanese deconverter and a Japanese-UNL enconverter
Exploratory document
[3] http://sig.media.eng.hokudai.ac.jp/~araki/araki_english.html
[4] Echizenya H., Araki K., Momouchi Y., Tochinai K., Study of Practical Effectiveness for Machine Translation
Using Recursive Chain-link-type Learning, Proc. of COLING-2002, Taipei, 2002.
[5] Sato S., and Nagao, M., Towards Memory-based Translation, In proc. of the Coling90, 1990
[6] Brown P.J. and al., A Statical Approach to Machine Translation, Computational Linguiatics, vol.16, No.2, 1990
[7] McTait, K, Linguistic knowledge and Complexity in an EBMT System Based on Translation Patters, In proc. of the
Workshop on EBMT, MT Summit VIII, 2001
[8] http://gohan.imag.fr/unldeco
[9] http://www.eco-bio.info/accueil.html
[10] http://www.stat.go.jp/data/sekai/index.htm
[11] Blanc E., From the UNL hypergraph to GETA’s multilevel tree. Proceeding of Machine Translation, University of
Exeter, 18-21 oct., and pp9.1—9.9. Ed. British Comp. Society, 2000.
[12] Echizenya H., Araki K., Momouchi Y., Tichinai K., Application of Genetic Algorithms for Example-based
Machine Translation Method Using Inductive Learning and Its Effectiveness, review of the society of the
information Processing vol.37, No.8, 1996, Japan
[13] Echizenya H., Araki K., Momouchi Y., Tichinai K., Machine Translation Method Using Inductive Learning with
Genetic Algorithms, Proc. of COLING-1996, Copenhagen, 1996.
Contents
Introduction .................................................................................................................................................. 1
1 ......................................................................................................................... Methodology
2
1.1Generation
of
Japanese
on
the
ARIANE-G5
environment
2
1.2Automatic
encoding
of
Japanese
text
into
UNL
graphs
3
1.2.1 Process No.1 : ............................................................................ Error! Bookmark not defined.
1.2.2 Process No.2 .............................................................................. Error! Bookmark not defined.
1.2.3 Process No.3 .............................................................................. Error! Bookmark not defined.
2 ................................................................................................................................ Domain
6
3Working
steps
and
timetable
for
18
month
7
4Persons
concerned
8
5Asking
scholarship
Error! Bookmark not defined.
6 ............................................................................................................................ References
8
Contents........................................................................................................................................................ 9
1
9/9
Download