Using Example-based Machine Translation for English – Vietnamese Translation Nguyen Minh Quang, Tran Dang Hung Software Engineering Department, Faculty of Information Technology Hanoi National University of Education {quangnm, hungtd}@hnue.edu.vn Abstract Recently, there is a significant amount of advantages in Machine Translation in Vietnam. Most approaches are based on the combination between grammar analyzing and a statistic-based method or a rule-based method. However, their results are still far from the human expectation. In this paper, we introduce a new approach which uses the example-based machine translation approach. The idea of this method is that using an aligned pair of sentences which is in Vietnamese and English and an algorithm to retrieve the most similar English sentence to the input sentence from the data resource. Then, we make a translation from the sentence retrieved. We applied the method to EnglishVietnamese translation using bilingual corpus including 6000 sentence pairs. The system approaches feasible translation ability and also achieves a good performance. Compare to other methods applied in English-Vietnamese translation, our method can get a higher translation quality. I. Introduction Machine translation has been studied and developed for many decades. For Vietnamese, there are some projects which proposed several approaches. Most approaches used a system based on analyzing and reflecting grammar structure (e.g. rule-based and copora-based approaches). Among them, the rule-based approach is a trend of direction on this field nowadays; with bilingual corpus and grammatical rules built carefully [7]. One of the biggest difficulties in rulebased translation as well as other methods is data resources. An important resource that is required for translation is the thesaurus which needs lots of effort and work to build [9]. This dataset, however, do not meet the human’s requirements yet. In addition, almost traditional methods also require knowledge about languages applied so it takes time to built a system for new languages [5, 6]. The Example Based Machine Translation (EBMT) is a new method, which relies on large corpora and tries somewhat to reject traditional linguistic notions [5]. EBMT systems are attractive in that output translations should more sensitive to contexts than rule-based systems, i.e. of higher quality in appropriateness and idiomaticity. Moreover, it requires a minimum of prior knowledge beyond the corpora which makes the example set, and are therefore quickly adapted to many language pairs [5]. EBMT is applied successfully in Japanese and American in some specific fields [1]. In Japanese, they built a system achieving a highquality translation and also an efficient processing in Travel Expression [1]. In Vietnamese, however, there’s no research following this method although the fact is that to apply in English-Vietnamese translation, this method doesn’t require too many resources and linguistic knowledge. We only have a English-Vietnamese Corpus dataset in Ho Chi Minh National University – the significant data resource with 40.000 pair of sentences (in Vietnamese and English) and about 5.500.000 words [8]. We already have the English thesaurus and EnglishVietnamese dictionary. About the set of aligned corpora, we have made 5.500 items for the research. In this paper, we use EBMT knowledge to build a system for English-Vietnamese 1 translation. We will apply graph based method [1] to Vietnamese language. In this kind of paradigm, we have a set, each item in this set is a pair of two sentences: one in the source language and one in the target language. From an input sentence, we carry out from the set a item which is the most similar sentence to the input. Finally, from the example and the input sentence, we adjust to provide a final sentence in target language. Unfortunately, we don’t have a Vietnamese, thesaurus so we proposed some solutions for this problem. In addition, this paper proposes a method to adapt the example sentence to provide the final translation. 1. EBMT overview: There are 3 components in a conventional example based system: - Matching Fragment Component. - Word Alignment. - Combination between the input and the example sentence carried out to provide the final target sentence. For example: (1) He buy a book on international politics (2) a. He buys a notebook. b. Anh ấy mua một quyển sách. (3) Anh ấy mua một quyển sách về chính trị thế giới. With the input sentence (1), the translation (3) can be provided by matching and adapting from (2a, b). One of the main advantages of this method is that, we can improve the quality of translation easily by widen the amount example set. The more items add, the better we have. It’s useful to apply for a specific field because the limit of form of the sentence included in these fields. For example, we use it to translate manuals of product, or weather forecast, or medical diagnosis. The difficulty to apply EBMT in Vietnam is that, there’s no word-net in Vietnamese, so we promote some new solutions to this problem. We build a system with 3 steps: - Form the set of example sentences, the result is the set of graphs. - Carry out the most popular example sentence to the input sentence. From an input sentence, using “edit distance” measuring, the system will find sentences which is the most similar to it. Editdistance is used for fast approximate between sentences, the smaller distance, the greater similarity between sentences. - Adjust the gap between the example and the input. 2. Data resource: We use 3 resources of data. That is: Bilingual corpus: this is the set of example sentences. This set includes pairs of sentences. Each sentence is performed as a word sequence. Spreading the size of the set will improve the quality of translation. The Thesaurus: A list of words showing similarities, differences, dependencies, and other relationships to each other. Bilingual Dictionary: We used the popular English Vietnamese dictionary file provided by Socbay company. 3. Build the graph of example set. The sentences are word sequences. We divide the words into 2 groups - Functional word: Functional words (or grammatical words or auto-semantic words) are words that have little lexical meaning or have ambiguous meaning, but instead serve to express grammatical relationships with other words within a sentence, or specify the attitude or mood of the speaker. - Content word: Words that are not function words are called content words (or open class words or lexical words): these include nouns, verbs, adjectives, and most adverbs, although some adverbs are function words (e.g., then and why). We classify the set into sub set. Each set includes sentences with the equal amount of content words and functional words. Based on the division, we build a group of graphs – word graphs: - They are directed graphs including start node and goal node. They includes nodes and edges, an edge is labeled with a word. In addition, each 2 edge has its own source node and destination node. - Each graph performs a sub set. Each sub set includes sentences with the same total of content word and the same total of functional word. - Each path from start node to goal node performs a candidate sentence. To optimize the system, we have to minimize the size of word graph. Common word sequences in different sentences use the same edge. Figure 1: Example of Word Graph The word graphs have to be optimized with the minimum number of node. We use the method of converting finite state automata [3, 4]. After preparing all resources for this method, we will execute 2 steps of it: example retrieval and adaption: 4. Example retrieval: We use the A*Search algorithm to approach the most similar sentences from word graph. The result of matching between two word sequences is a sequence of substitution, deletions and insertions. The search process in a word graph is to find a least distance between the input sentence and all the candidates perform in graph. As a result, matching sequences of path are approached as records which include a label and one or two words. Exact match: E(word) Substitution: S(word, word) Deletion: D(word) Insertion: I(word) For example: Matching sequence between the input sentence We close the door and the example She closes the teary eyes is: S(“She”, “We”) – E(“close”) – E(“the”) – I(“teary”) – S(“eyes”). The problem here is that we have to pick a sentence with the least distance to the input sentence. We firstly compare the total of Erecords in each matching sequence, then we compare S-records and so on. 5. Adaption: From the example approached, we adapt it to provide the final sentence in target language for input sentence by insertions, substitutions and deletions. To find the meaning of English words, we used morphological method. 5.1. Substitution, deletion and exaction: We will find the right position for the word in the final sentence for substitution, deletion and exaction. With deletion, we do nothing, but the problem here is that we have to find to meaning of word in substitution and deletion records. - There are some different meanings of a word, which one will be chosen? - Words in the dictionary are all in infinitive form while they can change to many other forms in the input sentence. We help to solve this problem carefully. Firstly, we find the type of word (noun, verb, adverb, …) in the sentence. We use Penn Tree Bank tagging system to specify the form of each word. Secondly, based on the form of word, we seek the word in the dictionary: If the word is plural (NNS): - If it ends with “CHES”, we try to delete “ES” and “CHES”, when the deletion makes an infinitive verb; we find the meaning in dictionary. Other case, it is specific noun. - If it ends with “XES” or “OES”, we delete “XES” or “OES” and find the meaning. - If it ends with “IES”: replace “IES” by “Y”. - If it ends with “VES”: replace “VES” by “F” or “FE”. - If it ends with “ES”: replace “ES” by “S”. - If it ends with “S”: delete “S”. After finding the meaning of plural, we add “những” before its meaning. If the word is gerund: 3 - Delete “ING” at the end of the word. We try two Insertion record in matching sequence, the cases. First is the word without “ING” and second final sentence in Vietnamese will be in low is the word without “ING” and with “IE” at the quality. We have to use the theory of ruled-based end. machine translation to solve this problem. We can use it in some specific phrase to find the better If the word is VBP: position instead of the order of records. - If the word is “IS”: it’s “TO BE”. If it ends with Firstly, link grammar system will parse the “IES”: replace “IES” by “Y” grammatical structure of sentence. The Link - If it ends with “SSES”: erase “ES” Grammar Parser is a syntactic parser of English, - If it ends with “S”: erase “S” based on link grammar, an original theory of If the word is in the past participle or past form: English syntax. Given a sentence, the system assigns it a syntactic structure, which consists of a - Check the word if it’s included in the list of set of labeled links connecting pairs of words. The irregular verb or not. If it’s included, we use the parser also produces a "constituent" infinitive form to find the meaning. The list of representation of a sentence (showing noun irregular verb is performing as red-black tree to phrases, verb phrases, etc.). make the search easier and faster. - If it ends with “IED”: erase “IED”. - If it ends with “ED”: check the very last 2 letter +-----O-----+ before “ED”, if they are identical then we erase 3 +-D-+--S--+ +--D--+ last letter of word. Other wise, we erase “ED”. | | | | | If the word is in present continuous form, we find the word in the same way with gerunds. After that The boy painted a picture we add “đang” after the meaning. - If the word is JJS: Delete 3 and 4 last 4 From the grammatical structure of consonant and find the meaning in the dictionary. sentence, we find out some phrases in English which need to change the order of word to After infinitive form of word is found, we use translate into Vietnamese. For example, the noun bilingual dictionary to seek the meaning. phrase “nice book”, with 2 I-records: I(nice) and The problem is that, when we reach the I(book), we used to translate into “hay quyển infinitive form of word, since there are many sách” instead of “quyển sách hay”. With link meanings with a kind of words, we have to choose grammar, we know the exact order to translate. the right one. In our experiment, we take the first Some phrases to process: meaning in the bilingual dictionary. 5.2. Insertion: The problem here is that we don’t know the exact position to fill the Vietnamese meaning. If we choose the position as the position of Table 1: Some phrase to process with Link Grammar 1 2 3 4 5 Noun phrase: POS(1, 2) = ({JJ}, {NN}). Reorder: ({NN}, {JJ}) Noun phrase: POS(1, 2, 3) = ({DT}, {JJ}, {NN}) && word1 = this, that, these, those. Reorder: ({NN}, {JJ}, {DT}) Noun phrase: POS(1, 2) = ({NN1}, {NN2}). Reorder: ({NN2}, {NN1}) Noun phrase: POS(1, 2) = ({PRP$}, {NN}). Reorder: ({NN}, {PRP$}) Noun phrase: POS(1, 2, 3) = ({JJ1}, {JJ2}, {NN}). Reorder: ({NN}, {JJ2}, {JJ1}) 4 5.3. Example: Input sentence: This nice book has been bought Example retrieval: the most similar example with input sentence found out is This computer has been bought by him. Sequence of records: E(“This”) – I(“nice”) – S(“computer”, “book”) – E(“has”) – E(“been”) – E(“bought”) – D(“by”) – D(“him”). With link grammar, there a noun phrase within the sentence “This nice book”, with the records E(“This”), I(“nice”), S(“computer”, “book”) respectively. We reorder the sequence: S(“computer”, “book”) – I(“nice”) – E(“This”) – E(“has”) – E(“been”) – E(“bought”) – D(“by”) – D(“him”). Based on new records sequence and the example, the adaption phase will be processed: - Exact Match: Keep the order and the meaning of word. “” – “” – “” – “này” – “được” – “mua” – “” – “” – “” - Substitution: Find the meaning of word in input sentence, replace the word in example by it. “Quyển sách” – “” – “” – “này” – “được” – “mua” – “” – “” – “” - Deletion: Just erase the word in example. “Quyển sách” – “” – “” – “này” – “được” – “mua” – “” – “” – “” - Insertion: We now have the right order of record, so we just finding the meaning of word in Insertion record and put it in order of the record in sequence. “Quyển sách” – “hay” – “” – “này” – “được” – “mua” – “” – “” – “”. After 4 steps of adaption, we provide the final sentence: “Quyển sách hay này được mua” 6. Evaluation: 6.1. Experimental Condition: We made manually an English-Vietnamese corpus including 6000 pairs of sentences. To evaluate translation quality, we employed subjective measure. Each translation result was graded into one of four ranks by bilingual human translator who is native speaker in Vietnamese. The four rank were: A: Perfect, no problem with both grammar and information. Translation quality is nearly equal to human translator. B: Fair. the translation is easy to understand but some grammatical mistake or missing some trivial information. C: Acceptable. The translation is broken but able to understand with effort. D: Nonsense: Important information was translated incorrectly. The English - Vietnamese dictionary used includes 70,000 words. To optimize the processing time, a threshold is used to limit the result set of Example retrieval phase. Table 2 show the threshold we used to optimized example retrieval phase with sentence’s length smaller than 30. If length of input sentence is greater than or equal to 30, threshold is 8. Table 2: Value of threshold 0 – 5 5 – 10 10 - 15 15 - 30 Length of sentence (words) 2 3 4 6 Threshold 6.2. Performance: For the experiment, we create two test sets: a test set of random sentence with incomplex grammatical structure and a set of 50 sentences edited from the training set. Under these conditions, the average processing time is less than 0.5 second for each providing each translation. Although the processing time increases as the corpus size increases, the increasing scale is not linear but about a half power of corpus size. Compare to DP-matching [2], the method used to retrieve example with word graph and A*Search achieves efficient processing. Using the threshold 0.2 with random sentences, where the time processing is significantly decreased, the translation quality is low. The reason is that we used the bilingual corpus with size is too small. As a result, examples approached are not similar enough to the input sentence. There are two ways to increase translation quality. Firstly, we widen the size of example set. Secondly, since we have not a appropriate way to 5 choose the right meaning from bilingual dictionary, we apply the context-based translation to EBMT system. The table 3 and table 4 illustrate the evaluation of result. Table 3: Set of edited sentences and performance Rank A B C D Total 25 11 3 11 Average length of 9.3 6.3 7.8 8.4 sentence Precision: 70% Table 4: Set of random sentences and performance Rank A B C D Total 15 10 3 22 Average length of 5.7 5.6 6.0 8 sentence Precision: 50% System can translate sentences with complex grammatical structure as long as the example approached is similar enough to the input sentences. 7. Conclusion: We report on a retrieval method for a EBMT system using edit-distance and evaluation of its performance using our corpus. In experiments for performance evaluation, we used bilingual corpus comprising 6000 sentences from every field. The reasons cause some low quality translation is the small size of bilingual corpus. The EBMT system will provide a better performance when it runs into a specific field. For example, we use EBMT to translate manuals of productions, or introductions in travel field. Experiment results show that the EBMT system achieved a feasible translation ability, and also achieved effort processing by using the proposed retrieval method. Acknowledgements: The author’s heartfelt thanks go to Professor Thu Huong Nguyen, Computer Science Department, Hanoi University of Technology for supporting the project, Socbay linguistic specialists for providing resources and helping us to test the system. Reference [1] Takao Doi, Hirofumi Yamamoto and Eiichiro Sumita. 2005. Graph-based Retrieval for Example-based Machine Translation Using Edit-distance. [2] Eiichiro Sumita, 2001. Example-based machine translation using DP-matching between word sequences. [3] John Edward Hopcroft and Jeffrey Ullman, 1979. Introduction to Automata Theory, Languages and Computation. AddisonWesley, Reading, MA. [4] Janusz Antoni Brzozowski, Canonical regular expressions and minimal state graphs for definite events, Mathematical Theory of Automata, 1962, MRI Symposia Series, Polytechnic Press, Polytechnic Institute of Brooklyn, NY, 12, 529–561. [5] Steven S Ngai and Randy B Gullett, 2002. Example-Based Machine Translation: An Investigation. [6] Ralf Brown 1996. Example-Based Machine Translation in the PanGloss System. In Proceedings of the Sixteenth International Conference on Computational Linguistics, Page 169-174, Copenhagen, Denmark. [7] Michael Carl. 1999: Inducing Translation Templates for Example-Based Machine Translation, In the Proceeding of MT-Summit VII, Singapore. [8] Đinh Điền, 2002, Building a training corpus for word sense disambiguation in English-toVietnamese Machine Translation. [9] Chunyu Kit, Haihua Pan and Jonathan J. Webste., 1994. Example-Based Machine Translation: A New Paradigm. [10] Kenji Imamura, Hideo Okuma, Taro Watanabe, and Eiichiro Sumita, 2004. Example-based Machine Translation Based on Syntactic Transfer with Statistical Models. 6