Using Example-based translation machine Retrieval

advertisement
Using Example-based Machine Translation for
English – Vietnamese Translation
Nguyen Minh Quang, Tran Dang Hung
Software Engineering Department, Faculty of Information Technology
Hanoi National University of Education
{quangnm, hungtd}@hnue.edu.vn
Abstract
Recently, there is a significant amount of
advantages in Machine Translation in Vietnam.
Most approaches are based on the combination
between grammar analyzing and a statistic-based
method or a rule-based method. However, their
results are still far from the human expectation.
In this paper, we introduce a new approach
which uses the example-based machine
translation approach. The idea of this method is
that using an aligned pair of sentences which is in
Vietnamese and English and an algorithm to
retrieve the most similar English sentence to the
input sentence from the data resource. Then, we
make a translation from the sentence retrieved.
We applied the method to EnglishVietnamese translation using bilingual corpus
including 6000 sentence pairs. The system
approaches feasible translation ability and also
achieves a good performance. Compare to other
methods
applied
in
English-Vietnamese
translation, our method can get a higher
translation quality.
I. Introduction
Machine translation has been studied and
developed for many decades. For Vietnamese,
there are some projects which proposed several
approaches. Most approaches used a system based
on analyzing and reflecting grammar structure
(e.g. rule-based and copora-based approaches).
Among them, the rule-based approach is a trend
of direction on this field nowadays; with bilingual
corpus and grammatical rules built carefully [7].
One of the biggest difficulties in rulebased translation as well as other methods is data
resources. An important resource that is required
for translation is the thesaurus which needs lots of
effort and work to build [9]. This dataset,
however, do not meet the human’s requirements
yet. In addition, almost traditional methods also
require knowledge about languages applied so it
takes time to built a system for new languages [5,
6].
The Example Based Machine Translation
(EBMT) is a new method, which relies on large
corpora and tries somewhat to reject traditional
linguistic notions [5]. EBMT systems are
attractive in that output translations should more
sensitive to contexts than rule-based systems, i.e.
of higher quality in appropriateness and
idiomaticity. Moreover, it requires a minimum of
prior knowledge beyond the corpora which makes
the example set, and are therefore quickly adapted
to many language pairs [5].
EBMT is applied successfully in Japanese
and American in some specific fields [1]. In
Japanese, they built a system achieving a highquality translation and also an efficient processing
in Travel Expression [1]. In Vietnamese,
however, there’s no research following this
method although the fact is that to apply in
English-Vietnamese translation, this method
doesn’t require too many resources and linguistic
knowledge. We only have a English-Vietnamese
Corpus dataset in Ho Chi Minh National
University – the significant data resource with
40.000 pair of sentences (in Vietnamese and
English) and about 5.500.000 words [8]. We
already have the English thesaurus and EnglishVietnamese dictionary. About the set of aligned
corpora, we have made 5.500 items for the
research.
In this paper, we use EBMT knowledge to
build a system for English-Vietnamese
1
translation. We will apply graph based method [1]
to Vietnamese language. In this kind of paradigm,
we have a set, each item in this set is a pair of two
sentences: one in the source language and one in
the target language. From an input sentence, we
carry out from the set a item which is the most
similar sentence to the input. Finally, from the
example and the input sentence, we adjust to
provide a final sentence in target language.
Unfortunately, we don’t have a Vietnamese,
thesaurus so we proposed some solutions for this
problem. In addition, this paper proposes a
method to adapt the example sentence to provide
the final translation.
1. EBMT overview:
There are 3 components in a conventional
example based system:
- Matching Fragment Component.
- Word Alignment.
- Combination between the input and the example
sentence carried out to provide the final target
sentence.
For example:
(1) He buy a book on international politics
(2) a. He buys a notebook.
b. Anh ấy mua một quyển sách.
(3) Anh ấy mua một quyển sách về chính trị thế
giới.
With the input sentence (1), the translation (3)
can be provided by matching and adapting from
(2a, b).
One of the main advantages of this method is
that, we can improve the quality of translation
easily by widen the amount example set. The
more items add, the better we have. It’s useful to
apply for a specific field because the limit of form
of the sentence included in these fields. For
example, we use it to translate manuals of
product, or weather forecast, or medical
diagnosis.
The difficulty to apply EBMT in Vietnam is
that, there’s no word-net in Vietnamese, so we
promote some new solutions to this problem.
We build a system with 3 steps:
- Form the set of example sentences, the result is
the set of graphs.
- Carry out the most popular example sentence to
the input sentence. From an input sentence, using
“edit distance” measuring, the system will find
sentences which is the most similar to it. Editdistance is used for fast approximate between
sentences, the smaller distance, the greater
similarity between sentences.
- Adjust the gap between the example and the
input.
2. Data resource:
We use 3 resources of data. That is:
Bilingual corpus: this is the set of example
sentences. This set includes pairs of sentences.
Each sentence is performed as a word sequence.
Spreading the size of the set will improve the
quality of translation.
The Thesaurus: A list of words showing
similarities, differences, dependencies, and other
relationships to each other.
Bilingual Dictionary: We used the popular
English Vietnamese dictionary file provided by
Socbay company.
3. Build the graph of example set.
The sentences are word sequences. We divide
the words into 2 groups
- Functional word: Functional words (or
grammatical words or auto-semantic words) are
words that have little lexical meaning or have
ambiguous meaning, but instead serve to express
grammatical relationships with other words within
a sentence, or specify the attitude or mood of the
speaker.
- Content word: Words that are not function
words are called content words (or open class
words or lexical words): these include nouns,
verbs, adjectives, and most adverbs, although
some adverbs are function words (e.g., then and
why).
We classify the set into sub set. Each set
includes sentences with the equal amount of
content words and functional words.
Based on the division, we build a group of graphs
– word graphs:
- They are directed graphs including start node
and goal node. They includes nodes and edges, an
edge is labeled with a word. In addition, each
2
edge has its own source node and destination
node.
- Each graph performs a sub set. Each sub set
includes sentences with the same total of content
word and the same total of functional word.
- Each path from start node to goal node performs
a candidate sentence. To optimize the system, we
have to minimize the size of word graph.
Common word sequences in different sentences
use the same edge.
Figure 1: Example of Word Graph
The word graphs have to be optimized with the
minimum number of node. We use the method of
converting finite state automata [3, 4].
After preparing all resources for this method, we
will execute 2 steps of it: example retrieval and
adaption:
4. Example retrieval:
We use the A*Search algorithm to approach
the most similar sentences from word graph. The
result of matching between two word sequences is
a sequence of substitution, deletions and
insertions. The search process in a word graph is
to find a least distance between the input sentence
and all the candidates perform in graph.
As a result, matching sequences of path are
approached as records which include a label and
one or two words.
Exact match: E(word)
Substitution: S(word, word)
Deletion:
D(word)
Insertion:
I(word)
For example: Matching sequence between the
input sentence We close the door and the example
She closes the teary eyes is:
S(“She”, “We”) – E(“close”) – E(“the”) –
I(“teary”) – S(“eyes”).
The problem here is that we have to pick a
sentence with the least distance to the input
sentence. We firstly compare the total of Erecords in each matching sequence, then we
compare S-records and so on.
5. Adaption:
From the example approached, we adapt it to
provide the final sentence in target language for
input sentence by insertions, substitutions and
deletions. To find the meaning of English words,
we used morphological method.
5.1. Substitution, deletion and exaction:
We will find the right position for the word in
the final sentence for substitution, deletion and
exaction. With deletion, we do nothing, but the
problem here is that we have to find to meaning of
word in substitution and deletion records.
- There are some different meanings of a word,
which one will be chosen?
- Words in the dictionary are all in infinitive form
while they can change to many other forms in the
input sentence.
We help to solve this problem carefully.
Firstly, we find the type of word (noun, verb,
adverb, …) in the sentence. We use Penn Tree
Bank tagging system to specify the form of each
word. Secondly, based on the form of word, we
seek the word in the dictionary:
If the word is plural (NNS):
- If it ends with “CHES”, we try to delete “ES”
and “CHES”, when the deletion makes an
infinitive verb; we find the meaning in dictionary.
Other case, it is specific noun.
- If it ends with “XES” or “OES”, we delete
“XES” or “OES” and find the meaning.
- If it ends with “IES”: replace “IES” by “Y”.
- If it ends with “VES”: replace “VES” by “F” or
“FE”.
- If it ends with “ES”: replace “ES” by “S”.
- If it ends with “S”: delete “S”.
After finding the meaning of plural, we add
“những” before its meaning.
If the word is gerund:
3
- Delete “ING” at the end of the word. We try two
Insertion record in matching sequence, the
cases. First is the word without “ING” and second
final sentence in Vietnamese will be in low
is the word without “ING” and with “IE” at the
quality. We have to use the theory of ruled-based
end.
machine translation to solve this problem. We can
use it in some specific phrase to find the better
If the word is VBP:
position instead of the order of records.
- If the word is “IS”: it’s “TO BE”. If it ends with
Firstly, link grammar system will parse the
“IES”: replace “IES” by “Y”
grammatical structure of sentence. The Link
- If it ends with “SSES”: erase “ES”
Grammar Parser is a syntactic parser of English,
- If it ends with “S”: erase “S”
based on link grammar, an original theory of
If the word is in the past participle or past form:
English syntax. Given a sentence, the system
assigns it a syntactic structure, which consists of a
- Check the word if it’s included in the list of
set of labeled links connecting pairs of words. The
irregular verb or not. If it’s included, we use the
parser
also
produces
a
"constituent"
infinitive form to find the meaning. The list of
representation of a sentence (showing noun
irregular verb is performing as red-black tree to
phrases, verb phrases, etc.).
make the search easier and faster.
- If it ends with “IED”: erase “IED”.
- If it ends with “ED”: check the very last 2 letter
+-----O-----+
before “ED”, if they are identical then we erase 3
+-D-+--S--+
+--D--+
last letter of word. Other wise, we erase “ED”.
|
|
|
|
|
If the word is in present continuous form, we find
the word in the same way with gerunds. After that
The boy painted a picture
we add “đang” after the meaning.
- If the word is JJS: Delete 3 and 4 last 4
From the
grammatical structure of
consonant and find the meaning in the dictionary.
sentence, we find out some phrases in English
which need to change the order of word to
After infinitive form of word is found, we use
translate into Vietnamese. For example, the noun
bilingual dictionary to seek the meaning.
phrase “nice book”, with 2 I-records: I(nice) and
The problem is that, when we reach the
I(book), we used to translate into “hay quyển
infinitive form of word, since there are many
sách” instead of “quyển sách hay”. With link
meanings with a kind of words, we have to choose
grammar, we know the exact order to translate.
the right one. In our experiment, we take the first
Some phrases to process:
meaning in the bilingual dictionary.
5.2. Insertion:
The problem here is that we don’t know
the exact position to fill the Vietnamese meaning.
If we choose the position as the position of
Table 1: Some phrase to process with Link Grammar
1
2
3
4
5
Noun phrase: POS(1, 2) = ({JJ}, {NN}). Reorder: ({NN}, {JJ})
Noun phrase: POS(1, 2, 3) = ({DT}, {JJ}, {NN}) && word1 = this, that, these, those.
Reorder: ({NN}, {JJ}, {DT})
Noun phrase: POS(1, 2) = ({NN1}, {NN2}). Reorder: ({NN2}, {NN1})
Noun phrase: POS(1, 2) = ({PRP$}, {NN}). Reorder: ({NN}, {PRP$})
Noun phrase: POS(1, 2, 3) = ({JJ1}, {JJ2}, {NN}). Reorder: ({NN}, {JJ2}, {JJ1})
4
5.3. Example:
Input sentence: This nice book has been bought
Example retrieval: the most similar example with
input sentence found out is
This computer has been bought by him.
Sequence of records:
E(“This”) – I(“nice”) – S(“computer”, “book”)
– E(“has”) – E(“been”) – E(“bought”) –
D(“by”) – D(“him”).
With link grammar, there a noun phrase
within the sentence “This nice book”, with the
records E(“This”), I(“nice”), S(“computer”,
“book”) respectively. We reorder the sequence:
S(“computer”, “book”) – I(“nice”) – E(“This”)
– E(“has”) – E(“been”) – E(“bought”) –
D(“by”) – D(“him”).
Based on new records sequence and the
example, the adaption phase will be processed:
- Exact Match: Keep the order and the meaning of
word.
“” – “” – “” – “này” – “được” – “mua” – “” –
“” – “”
- Substitution: Find the meaning of word in input
sentence, replace the word in example by it.
“Quyển sách” – “” – “” – “này” – “được” –
“mua” – “” – “” – “”
- Deletion: Just erase the word in example.
“Quyển sách” – “” – “” – “này” – “được” –
“mua” – “” – “” – “”
- Insertion: We now have the right order of
record, so we just finding the meaning of word in
Insertion record and put it in order of the record in
sequence.
“Quyển sách” – “hay” – “” – “này” – “được” –
“mua” – “” – “” – “”.
After 4 steps of adaption, we provide the final
sentence: “Quyển sách hay này được mua”
6. Evaluation:
6.1. Experimental Condition:
We made manually an English-Vietnamese
corpus including 6000 pairs of sentences.
To evaluate translation quality, we employed
subjective measure. Each translation result was
graded into one of four ranks by bilingual human
translator who is native speaker in Vietnamese.
The four rank were:
A: Perfect, no problem with both grammar
and information. Translation quality is nearly
equal to human translator.
B: Fair. the translation is easy to
understand but some grammatical mistake or
missing some trivial information.
C: Acceptable. The translation is broken
but able to understand with effort.
D: Nonsense: Important information was
translated incorrectly.
The English - Vietnamese dictionary used
includes 70,000 words. To optimize the
processing time, a threshold is used to limit the
result set of Example retrieval phase. Table 2
show the threshold we used to optimized example
retrieval phase with sentence’s length smaller than
30. If length of input sentence is greater than or
equal to 30, threshold is 8.
Table 2: Value of threshold
0 – 5 5 – 10 10 - 15 15 - 30
Length of
sentence (words)
2
3
4
6
Threshold
6.2. Performance:
For the experiment, we create two test sets: a
test set of random sentence with incomplex
grammatical structure and a set of 50 sentences
edited from the training set.
Under these conditions, the average
processing time is less than 0.5 second for each
providing each translation. Although the
processing time increases as the corpus size
increases, the increasing scale is not linear but
about a half power of corpus size. Compare to
DP-matching [2], the method used to retrieve
example with word graph and A*Search achieves
efficient processing. Using the threshold 0.2 with
random sentences, where the time processing is
significantly decreased, the translation quality is
low. The reason is that we used the bilingual
corpus with size is too small. As a result,
examples approached are not similar enough to
the input sentence.
There are two ways to increase translation
quality. Firstly, we widen the size of example set.
Secondly, since we have not a appropriate way to
5
choose the right meaning from bilingual
dictionary, we apply the context-based translation
to EBMT system.
The table 3 and table 4 illustrate the
evaluation of result.
Table 3: Set of edited sentences and
performance
Rank
A
B
C
D
Total
25
11
3
11
Average length of 9.3
6.3
7.8 8.4
sentence
Precision: 70%
Table 4: Set of random sentences and
performance
Rank
A
B
C
D
Total
15
10
3
22
Average length of 5.7
5.6
6.0 8
sentence
Precision: 50%
System can translate sentences with complex
grammatical structure as long as the example
approached is similar enough to the input
sentences.
7. Conclusion:
We report on a retrieval method for a EBMT
system using edit-distance and evaluation of its
performance using our corpus. In experiments for
performance evaluation, we used bilingual corpus
comprising 6000 sentences from every field. The
reasons cause some low quality translation is the
small size of bilingual corpus. The EBMT system
will provide a better performance when it runs
into a specific field. For example, we use EBMT
to translate manuals of productions, or
introductions in travel field. Experiment results
show that the EBMT system achieved a feasible
translation ability, and also achieved effort
processing by using the proposed retrieval
method.
Acknowledgements:
The author’s heartfelt thanks go to Professor
Thu Huong Nguyen, Computer Science
Department, Hanoi University of Technology for
supporting the project, Socbay linguistic
specialists for providing resources and helping us
to test the system.
Reference
[1] Takao Doi, Hirofumi Yamamoto and Eiichiro
Sumita. 2005. Graph-based Retrieval for
Example-based Machine Translation Using
Edit-distance.
[2] Eiichiro Sumita, 2001. Example-based
machine translation using DP-matching
between word sequences.
[3] John Edward Hopcroft and Jeffrey Ullman,
1979. Introduction to Automata Theory,
Languages and Computation. AddisonWesley, Reading, MA.
[4] Janusz Antoni Brzozowski, Canonical regular
expressions and minimal state graphs for
definite events, Mathematical Theory of
Automata, 1962, MRI Symposia Series,
Polytechnic Press, Polytechnic Institute of
Brooklyn, NY, 12, 529–561.
[5] Steven S Ngai and Randy B Gullett, 2002.
Example-Based Machine Translation: An
Investigation.
[6] Ralf Brown 1996. Example-Based Machine
Translation in the PanGloss System. In
Proceedings of the Sixteenth International
Conference on Computational Linguistics,
Page 169-174, Copenhagen, Denmark.
[7] Michael Carl. 1999: Inducing Translation
Templates for Example-Based Machine
Translation, In the Proceeding of MT-Summit
VII, Singapore.
[8] Đinh Điền, 2002, Building a training corpus
for word sense disambiguation in English-toVietnamese Machine Translation.
[9] Chunyu Kit, Haihua Pan and Jonathan J.
Webste., 1994. Example-Based Machine
Translation: A New Paradigm.
[10] Kenji Imamura, Hideo Okuma, Taro
Watanabe, and Eiichiro Sumita, 2004.
Example-based Machine Translation Based on
Syntactic Transfer with Statistical Models.
6
Download