A Hybrid Filipino-English Machine Translation System. (doc)

advertisement
Hybrid Filipino-English Machine Translation System
R. O. Roxas* and G. A. Fontanilla†
Software Technology Department, De La Salle University,
2401 Taft Avenue, Manila, Philippines
Approaches in machine translation in the past have been generally classified as either
knowledge-based or corpus-based. A knowledge-based approach involves expert knowledge captured in
rules while a corpus-based approach infers translation knowledge from example translations. It has been
evident from the inception of machine translation to the present day, that utilizing a single approach, albeit
knowledge-based or corpus-based, is not enough to fulfill the needs of a modern-day machine translation
system. In order to address these issues, an amalgamation of approaches has been proposed. A hybrid
approach is presented in an attempt to exploit the advantages of the rule-based approach – a kind of
knowledge-based approach, and the example-based approach – a kind of corpus-based approach. The focus of
the discussion would be on the learning aspect of the hybrid system, where transfer rules and unification
constraints are derived.
1. INTRODUCTION
1.1. Overview of MT
In this age of information and globalization, the need to
cross language-barriers in order to distribute information
regarding products, services, or entertainment has never
been greater. The demand for machine translation
systems has reached the point where the volume of
material that needs to be translated is already too much
for human translators to handle [1]. Information such as
news reports need to be translated to different languages
on a daily basis, thus the need for machine translation
tools has become evident.
However, despite more than fifty years of existence,
machine translation research is still pursuing its ultimate
goal, which is to achieve the best translation quality. And
it has been mentioned that a single MT paradigm, cannot
fulfill this goal [2]. This has led to the integration of
paradigms into a single system.
1.2. MT Paradigms
2.2.1. Knowledge-based approach
The knowledge-based approach is characterized by the
encapsulation of expert knowledge into rules and
knowledge-bases. The approach is tedious because
traditionally, the knowledge-bases are manually built.
However, this approach yields high quality translation in
limited domains due to the captured linguistic knowledge
provided by linguists [3]. But at the same time, is difficult
to maintain and extend. Rule-based approaches generally
fall under this category.
2.2.2. Corpus-based approach
*Electronic address:
† Electronic
roxasr@dlsu.edu.ph
address: i_am_ife@yahoo.com
The corpus-based approach is the opposite of the
knowledge-based approach. In here, no expert rules are
gathered, but translation knowledge is automatically
learned from a training set of translation examples or
corpus. This makes it easily extendable to different
domains, however the quality of translation is highly
dependent on the training set [2]. Example-based and
statistical machine translation systems fall under this
category.
2.2.3. Multi-paradigm approach
This approach combines the existing paradigms into a
single system. This amalgamation of paradigms
maximizes the advantages of the different approaches
and minimizes the disadvantages. A multi-paradigm
approach can either be multi-engine or hybrid.
In a multi-engine MT, several engines are employed in
either a parallel or sequential manner, where the
different engines are independent of each other. The
different outputs are consolidated by a selection or
combination algorithm.
A hybrid MT system focuses on the interactions of the
different paradigms. The translation phases may use
different paradigms and communicate with each other to
produce a single output. Different components and
resources may be shared by the different paradigms,
making the hybrid system more robust [2].
2. RELATED SYSTEMS
2.2. LEFT
LEFT is a bi-directional, rule-based MT system. LEFT
uses the lexical-functional grammar (LFG) as its
formalism for analyzing as well as translating English
and Filipino sentences. LFG uses constituent structures
or c-structures and functional structures or f-structures
in order to capture the syntactic and
(semantic) forms of a given sentence [4].
functional
LEFT accepts an either an English or Filipino input
sentence, parses the sentence until an f-structure
representation of the input sentence is determined. The fstructure of this source language sentence is translated
into the f-structure of the target language using manually
built transfer rules. The target language f-structure is
then converted back into sentence form.
In theory, LEFT’s strength lies in its ability to analyze
a sentence because it uses constituent structure and
functional structure representations. Since its generation
module also accesses its grammar rules, this is also an
advantageous characteristic because theoretically, it
ensures that the generated output is grammatically
correct.
However LEFT’s evident weaknesses is in its transfer
rules. The transfer rules were built manually by the
proponents, and only cover a limited range of sentence
types that could be translated.
Filipino sentences which are translations of each other. In
this section, only the learning component of the transfer
rules is discussed – the template learning is adapted from
Textual translation.
The two sentences of an example are first aligned using
unit alignment. Unit alignment involves matching a
certain word from one sentence from the source language
with its one-word translation in its matching sentence
from the target language. During training, designating
which language is the source or target doesn’t matter as
translation is intended to be bidirectional. If a word does
not have a one-word translation, then its alignment is
null. This unit alignment module is adapted from Textual
translation. Table I shows an example of unit alignment.
TABLE I: Unit alignment for the example “Si Marlon
ay naglalaro. = Marlon is playing.”
Filipino word
English word
Alignment
#
Si
Null
1
Marlon
Marlon
2
ay
is
3
naglalaro
playing
4
2.2. TExtual Translation
Textual Translation is an example-based system which
makes use of transition templates for bidirectional
English and Filipino translation. The templates are
extracted from aligned examples. These templates are
created by generalizing two or more similar sentences
wherein the differences in the sentences are replaced
with variables and re-occurring sentence elements remain
as constants. Templates can further be generalized
through training [5].
Textual Translation’s strengths lies in its flexibility.
However, since it uses templates, it lacks the formalism
of grammatical and transfer rules. Therefore, in theory,
the output may or may not be grammatically-sound
especially if the training set contains grammatical errors.
After alignment, the matching sentence pair is then
parsed, and a c-structure is generated for each. The cstructures can also be thought of as a parse trees. Using
the results of unit alignment, the transfer rules could now
be derived. Figure 1 shows the c-structure derivations for
the same sentence.
Filipino
FS ROOT
FS ay: ay
FS SUBJ
FS nom: si
FS HEAD
FS proper: Marlon
3. THE
HYBRID
ARCHITECTURE
MT
The hybrid MT system is built upon the existing
systems, LEFT and Textual translation. It combines the
engines of both LEFT and Textual translation, and
Textual translation’s template learning module is
augmented with a transfer rule learning module for the
rule-based engine. The hybrid system has the ability of
choosing which engine to use, depending on the input
sentence.
3.1. TRANSFER RULE LEARNING
The learning module of the hybrid system has two main
functions: 1) to learn transfer rules 2) to learn translation
templates. Both are learned from the same training set. A
training set is a set of examples or matching English and
FS verb: naglalaro
English
FS ROOT
FS aux: is
FS SUBJ
FS HEAD
FS proper_noun: Marlon
FS verb: playing
Fig. 1: C-structures for the sentence example “Si
Marlon ay naglalaro. = Marlon is playing.”
For each given example, a transfer rule set is derived.
There are two kinds of transfer rules learned: seed rules
and compositional rules. A transfer rule set is a collection
of seed and compositional rules generated from that given
example.
A seed rule is basically a one-to-one correspondence
between matching elements of the sentence pair in an
example. It is called a seed rule because it involves only
the terminal symbols in the parse trees, and is actually
taken directly from the unit alignment. The seed rules
derived are shown in Figure 2.
Fil side= nom
Eng side= null
Fil side= verb
Eng side= verb
Fil side= ay
Eng side= aux
Fil side= proper
Eng side= proper_noun
Fig. 2: Seed rules
A compositional rule is the composition of seed rules or
other compositional rules into a larger compositional
element. This is similar to a production rule in contextfree grammars. It contains a head element at the left
side, and the production at the right side. It should be
noted that all seed rules need not be derived first before
the compositional rules. The maximum compositional rule
in a transfer rule set is the compositional rule with the
ROOT as the head element for both sides. Figure 3 shows
the derived compositional rules.
Fil side= HEAD : proper
Eng side= HEAD : proper_noun
Fil side= SUBJ : nom HEAD
Eng side= SUBJ : HEAD
Fil side= root : ay SUBJ verb
Eng side= root : aux SUBJ verb
Fig. 3: Compositional rules
In deriving the compositional rules, maximum
compositionality is maintained since all productions are
provided by the parse tree/c-structure representation.
This means that the top-most rule or the maximum
compositional rule would be Sentence <-> Sentence or in
this case ROOT: ay SUBJ verb <-> ROOT: aux SUBJ
verb.
In order to address semantics, unification constraints
are used. Unification constraints are useful for handling
feature passing from source to target language. An
important feature is tense (e.g. past, present, future).
Constraints also handle determiner dropping and subjectverb agreement in English-Filipino and Filipino-English
translation.
Constraints are introduced at the transfer rules, both
in seed and compositional rules. The constraint indicates
which feature should be maintained. An example of a
unification constraint is shown in Figure 4.
Fil side= ay
Eng side= aux
(F1 NUM: any = E1 NUM: sing)
Fig. 4: Unification constraint
This unification constraint indicates that the English
auxiliary verb has a NUM feature, sing, which means it
has a plurality of singular. The Filipino ay on the other
hand, is neutral.
3.2. TRANSLATION MODULE
During translation, the hybrid system accepts an input
or source language sentence of either English or Filipino.
The source language is then parsed and a constituent
structure/parse tree is generated. If the sentence was
parsed successfully, meaning that it is accepted by the
grammar, the rule-based engine is used for translation.
However, if the source language sentence could not be
parsed, the example-based engine is used. If the source
language sentence is parsed successfully but there are no
sufficient transfer rules for it, then the example-based
engine is also used.
The candidate transfer rule sets are filtered from the
entire transfer rule database by the maximum
compositional rule (the rule which contains the ROOT
structure). The top most production of the parse tree will
be compared against all the maximum compositional
rules of all the transfer rule sets. For example, if the topmost production of the parsed source language is ROOT :
ay SUBJ verb , then all candidates will have the same
production on one side of their maximum compositional
rules.
All productions from the source language parse tree
should be present in the transfer rule set in order for the
transfer rule set to be considered an appropriate rule. If
no transfer rule set satisfies this condition, other transfer
rule sets will be matched for the missing productions. The
rule set with the highest match of productions plus the
additional rules from other rule sets will be used for
translation. However, if there are productions which are
not covered by the entire transfer rule database, then
there is no appropriate transfer rule set for the input
sentence and the example-based engine is used.
In using the example-based engine, the source
language sentence is matched with the most similar
template. Then translation proceeds with using the most
similar template, wherein the variables in the template
are translated accordingly.
4. DISCUSSION
The hybrid system is currently in its development
stage. Most of the components were adapted from LEFT
and Textual translation since systems had completed
components which were already available. Although, the
system is yet to be completed, it already shows promising
results, in that it is able to capture linguistic phenomena
that are not handled by LEFT. These include the SubjectVerb agreement constraint and determiner dropping
which LEFT was unable to address.
The use of two engines is for robustness. The rule-engine
is intended to be used as much as possible, whereas the
example-based engine is present as its backup. The
example-based engine is also intended to handle
idiomatic expressions, which are generally not
grammatically correct. It’s frequency of use however,
remains to be tested.
J. Hutchins, Machine Translation and ComputerBased Translation Tools: What’s Available and
How It’s Used, A New Spectrum of Translation
R.E.A.L. Translation, Undergraduate Thesis (De La
[1]
Studies pp. 13-48
Valladolid, 2004).
(University
of
Valladolid,
[2] R. Roxas, Towards a hybrid machine translation
system for Philippine languages, Henry Sy
Professorial chair lecture (De La Salle University
Manila, 2003).
[3] D. Alcantara, B. Hong, A. Perez and L. Tan, Rule
Extraction Applied in Language Translation –
Salle University Manila, 2006).
[4] E. Chan, C. Lim, R. Tan and M. Tong, LEFT: Lexical
Functional Grammar Based English Filipino
Translator, Undergraduate Thesis (De La Salle
University Manila, 2006).
[5] K. Go, M. Morga, V. Nunez and F. Veto, TExt
Translation: Template Extraction for a Bidirectional
English-Filipino
Example-Based
Machine
Translation, Undergraduate Thesis (De La Salle
University Manila, 2006).
Download