Hybrid Filipino-English Machine Translation System R. O. Roxas* and G. A. Fontanilla† Software Technology Department, De La Salle University, 2401 Taft Avenue, Manila, Philippines Approaches in machine translation in the past have been generally classified as either knowledge-based or corpus-based. A knowledge-based approach involves expert knowledge captured in rules while a corpus-based approach infers translation knowledge from example translations. It has been evident from the inception of machine translation to the present day, that utilizing a single approach, albeit knowledge-based or corpus-based, is not enough to fulfill the needs of a modern-day machine translation system. In order to address these issues, an amalgamation of approaches has been proposed. A hybrid approach is presented in an attempt to exploit the advantages of the rule-based approach – a kind of knowledge-based approach, and the example-based approach – a kind of corpus-based approach. The focus of the discussion would be on the learning aspect of the hybrid system, where transfer rules and unification constraints are derived. 1. INTRODUCTION 1.1. Overview of MT In this age of information and globalization, the need to cross language-barriers in order to distribute information regarding products, services, or entertainment has never been greater. The demand for machine translation systems has reached the point where the volume of material that needs to be translated is already too much for human translators to handle [1]. Information such as news reports need to be translated to different languages on a daily basis, thus the need for machine translation tools has become evident. However, despite more than fifty years of existence, machine translation research is still pursuing its ultimate goal, which is to achieve the best translation quality. And it has been mentioned that a single MT paradigm, cannot fulfill this goal [2]. This has led to the integration of paradigms into a single system. 1.2. MT Paradigms 2.2.1. Knowledge-based approach The knowledge-based approach is characterized by the encapsulation of expert knowledge into rules and knowledge-bases. The approach is tedious because traditionally, the knowledge-bases are manually built. However, this approach yields high quality translation in limited domains due to the captured linguistic knowledge provided by linguists [3]. But at the same time, is difficult to maintain and extend. Rule-based approaches generally fall under this category. 2.2.2. Corpus-based approach *Electronic address: † Electronic roxasr@dlsu.edu.ph address: i_am_ife@yahoo.com The corpus-based approach is the opposite of the knowledge-based approach. In here, no expert rules are gathered, but translation knowledge is automatically learned from a training set of translation examples or corpus. This makes it easily extendable to different domains, however the quality of translation is highly dependent on the training set [2]. Example-based and statistical machine translation systems fall under this category. 2.2.3. Multi-paradigm approach This approach combines the existing paradigms into a single system. This amalgamation of paradigms maximizes the advantages of the different approaches and minimizes the disadvantages. A multi-paradigm approach can either be multi-engine or hybrid. In a multi-engine MT, several engines are employed in either a parallel or sequential manner, where the different engines are independent of each other. The different outputs are consolidated by a selection or combination algorithm. A hybrid MT system focuses on the interactions of the different paradigms. The translation phases may use different paradigms and communicate with each other to produce a single output. Different components and resources may be shared by the different paradigms, making the hybrid system more robust [2]. 2. RELATED SYSTEMS 2.2. LEFT LEFT is a bi-directional, rule-based MT system. LEFT uses the lexical-functional grammar (LFG) as its formalism for analyzing as well as translating English and Filipino sentences. LFG uses constituent structures or c-structures and functional structures or f-structures in order to capture the syntactic and (semantic) forms of a given sentence [4]. functional LEFT accepts an either an English or Filipino input sentence, parses the sentence until an f-structure representation of the input sentence is determined. The fstructure of this source language sentence is translated into the f-structure of the target language using manually built transfer rules. The target language f-structure is then converted back into sentence form. In theory, LEFT’s strength lies in its ability to analyze a sentence because it uses constituent structure and functional structure representations. Since its generation module also accesses its grammar rules, this is also an advantageous characteristic because theoretically, it ensures that the generated output is grammatically correct. However LEFT’s evident weaknesses is in its transfer rules. The transfer rules were built manually by the proponents, and only cover a limited range of sentence types that could be translated. Filipino sentences which are translations of each other. In this section, only the learning component of the transfer rules is discussed – the template learning is adapted from Textual translation. The two sentences of an example are first aligned using unit alignment. Unit alignment involves matching a certain word from one sentence from the source language with its one-word translation in its matching sentence from the target language. During training, designating which language is the source or target doesn’t matter as translation is intended to be bidirectional. If a word does not have a one-word translation, then its alignment is null. This unit alignment module is adapted from Textual translation. Table I shows an example of unit alignment. TABLE I: Unit alignment for the example “Si Marlon ay naglalaro. = Marlon is playing.” Filipino word English word Alignment # Si Null 1 Marlon Marlon 2 ay is 3 naglalaro playing 4 2.2. TExtual Translation Textual Translation is an example-based system which makes use of transition templates for bidirectional English and Filipino translation. The templates are extracted from aligned examples. These templates are created by generalizing two or more similar sentences wherein the differences in the sentences are replaced with variables and re-occurring sentence elements remain as constants. Templates can further be generalized through training [5]. Textual Translation’s strengths lies in its flexibility. However, since it uses templates, it lacks the formalism of grammatical and transfer rules. Therefore, in theory, the output may or may not be grammatically-sound especially if the training set contains grammatical errors. After alignment, the matching sentence pair is then parsed, and a c-structure is generated for each. The cstructures can also be thought of as a parse trees. Using the results of unit alignment, the transfer rules could now be derived. Figure 1 shows the c-structure derivations for the same sentence. Filipino FS ROOT FS ay: ay FS SUBJ FS nom: si FS HEAD FS proper: Marlon 3. THE HYBRID ARCHITECTURE MT The hybrid MT system is built upon the existing systems, LEFT and Textual translation. It combines the engines of both LEFT and Textual translation, and Textual translation’s template learning module is augmented with a transfer rule learning module for the rule-based engine. The hybrid system has the ability of choosing which engine to use, depending on the input sentence. 3.1. TRANSFER RULE LEARNING The learning module of the hybrid system has two main functions: 1) to learn transfer rules 2) to learn translation templates. Both are learned from the same training set. A training set is a set of examples or matching English and FS verb: naglalaro English FS ROOT FS aux: is FS SUBJ FS HEAD FS proper_noun: Marlon FS verb: playing Fig. 1: C-structures for the sentence example “Si Marlon ay naglalaro. = Marlon is playing.” For each given example, a transfer rule set is derived. There are two kinds of transfer rules learned: seed rules and compositional rules. A transfer rule set is a collection of seed and compositional rules generated from that given example. A seed rule is basically a one-to-one correspondence between matching elements of the sentence pair in an example. It is called a seed rule because it involves only the terminal symbols in the parse trees, and is actually taken directly from the unit alignment. The seed rules derived are shown in Figure 2. Fil side= nom Eng side= null Fil side= verb Eng side= verb Fil side= ay Eng side= aux Fil side= proper Eng side= proper_noun Fig. 2: Seed rules A compositional rule is the composition of seed rules or other compositional rules into a larger compositional element. This is similar to a production rule in contextfree grammars. It contains a head element at the left side, and the production at the right side. It should be noted that all seed rules need not be derived first before the compositional rules. The maximum compositional rule in a transfer rule set is the compositional rule with the ROOT as the head element for both sides. Figure 3 shows the derived compositional rules. Fil side= HEAD : proper Eng side= HEAD : proper_noun Fil side= SUBJ : nom HEAD Eng side= SUBJ : HEAD Fil side= root : ay SUBJ verb Eng side= root : aux SUBJ verb Fig. 3: Compositional rules In deriving the compositional rules, maximum compositionality is maintained since all productions are provided by the parse tree/c-structure representation. This means that the top-most rule or the maximum compositional rule would be Sentence <-> Sentence or in this case ROOT: ay SUBJ verb <-> ROOT: aux SUBJ verb. In order to address semantics, unification constraints are used. Unification constraints are useful for handling feature passing from source to target language. An important feature is tense (e.g. past, present, future). Constraints also handle determiner dropping and subjectverb agreement in English-Filipino and Filipino-English translation. Constraints are introduced at the transfer rules, both in seed and compositional rules. The constraint indicates which feature should be maintained. An example of a unification constraint is shown in Figure 4. Fil side= ay Eng side= aux (F1 NUM: any = E1 NUM: sing) Fig. 4: Unification constraint This unification constraint indicates that the English auxiliary verb has a NUM feature, sing, which means it has a plurality of singular. The Filipino ay on the other hand, is neutral. 3.2. TRANSLATION MODULE During translation, the hybrid system accepts an input or source language sentence of either English or Filipino. The source language is then parsed and a constituent structure/parse tree is generated. If the sentence was parsed successfully, meaning that it is accepted by the grammar, the rule-based engine is used for translation. However, if the source language sentence could not be parsed, the example-based engine is used. If the source language sentence is parsed successfully but there are no sufficient transfer rules for it, then the example-based engine is also used. The candidate transfer rule sets are filtered from the entire transfer rule database by the maximum compositional rule (the rule which contains the ROOT structure). The top most production of the parse tree will be compared against all the maximum compositional rules of all the transfer rule sets. For example, if the topmost production of the parsed source language is ROOT : ay SUBJ verb , then all candidates will have the same production on one side of their maximum compositional rules. All productions from the source language parse tree should be present in the transfer rule set in order for the transfer rule set to be considered an appropriate rule. If no transfer rule set satisfies this condition, other transfer rule sets will be matched for the missing productions. The rule set with the highest match of productions plus the additional rules from other rule sets will be used for translation. However, if there are productions which are not covered by the entire transfer rule database, then there is no appropriate transfer rule set for the input sentence and the example-based engine is used. In using the example-based engine, the source language sentence is matched with the most similar template. Then translation proceeds with using the most similar template, wherein the variables in the template are translated accordingly. 4. DISCUSSION The hybrid system is currently in its development stage. Most of the components were adapted from LEFT and Textual translation since systems had completed components which were already available. Although, the system is yet to be completed, it already shows promising results, in that it is able to capture linguistic phenomena that are not handled by LEFT. These include the SubjectVerb agreement constraint and determiner dropping which LEFT was unable to address. The use of two engines is for robustness. The rule-engine is intended to be used as much as possible, whereas the example-based engine is present as its backup. The example-based engine is also intended to handle idiomatic expressions, which are generally not grammatically correct. It’s frequency of use however, remains to be tested. J. Hutchins, Machine Translation and ComputerBased Translation Tools: What’s Available and How It’s Used, A New Spectrum of Translation R.E.A.L. Translation, Undergraduate Thesis (De La [1] Studies pp. 13-48 Valladolid, 2004). (University of Valladolid, [2] R. Roxas, Towards a hybrid machine translation system for Philippine languages, Henry Sy Professorial chair lecture (De La Salle University Manila, 2003). [3] D. Alcantara, B. Hong, A. Perez and L. Tan, Rule Extraction Applied in Language Translation – Salle University Manila, 2006). [4] E. Chan, C. Lim, R. Tan and M. Tong, LEFT: Lexical Functional Grammar Based English Filipino Translator, Undergraduate Thesis (De La Salle University Manila, 2006). [5] K. Go, M. Morga, V. Nunez and F. Veto, TExt Translation: Template Extraction for a Bidirectional English-Filipino Example-Based Machine Translation, Undergraduate Thesis (De La Salle University Manila, 2006).