1 - Carnegie Mellon University

advertisement
Design and Implementation of Controlled Elicitation for Machine Translation
of Low-density Languages
Katharina Probst, Alon Lavie, Lori Levin, Erik Peterson, Jaime Carbonell
Language Technologies Institute
Carnegie Mellon University
5000 Forbes Avenue
Pittsburgh, PA 15213
U.S.A.
kathrin@cs.cmu.edu
Abstract
NICE (Native language Interpretation and Communication Environment) is a machine translation project for low-density
languages. For these languages very little online data is available, so we are forced to design new mechanisms to rapidly build
machine translation systems. In addition to using current rapid deployment mechanisms (EBMT, later SMT), we are building a
tool that will elicit a controlled corpus and automatically infer transfer rules. The focus of this paper is the design of the
elicitation corpus and tool. The elicitation corpus is intended to cover major typological phenomena, as it is designed to work for
any language. The tool is a means of eliciting the data from a bilingual speaker, who is not an expert in linguistics. The
informant will simply be asked to translate a number of sentences into the native language and specify the alignment of the
words. As most linguistic features (e.g. plural) are not employed in all languages, we strive to minimize the number of sentences
that each informant has to translate. This is achieved by means of Implicational Universals. From the elicited sentences, we
learn transfer rules with a Version Space learning algorithm.
1. Introduction and Motivation
With some exceptions (e.g. [6] and [7]), machine translation research and development in the past has focused
on so-called “major” languages, such as English, Spanish, German, and French. The reasons are understandable: it
is much easier to obtain a bilingual corpus for these languages than for languages spoken by small native
communities. Furthermore, funding agencies have tended to emphasize these languages simply because they were
considered more practical since more people would be able to use them. However, a machine translation system for
a low-density language can be extremely useful for its native speakers. They often have difficulties communicating
with the mainstream culture in their country. But communication is crucial in the case of medical problems, official
negotiations, crop failure, etc. Providing them with a tool that would facilitate communication can be extremely
helpful for the cultural preservation of these communities. If the natives can communicate in their language, they
will be more likely to teach it to their children. MT additionally enables the governments to design educational
programs in the native languages. Furthermore, the preservation of endangered languages is an important goal from
a linguistic point of view, as with languages many unique linguistic features die out and can never be studied.
A machine translation system for a low-density language can thus serve as a tool for preservation as well as for
communication between the native people and their governments.
2. Other Approaches
Somers [14] acknowledges the lack of and need for machine translation systems for low-density languages.
However, the methods he proposes rely on a large bilingual corpus. For native languages, such a corpus is not
always available and has to be collected. This constitutes a problem if a system has to be developed quickly.
Nirenburg and his fellow researchers have addressed this issue in the project Boas [11-13]. Here, the bilingual
users are asked to translate sentences, but are also questioned about if and how linguistic parameters are realized in
their native language. The idea is that a number of parameters exist in human language, and each specific language
chooses among them and realizes them in its own way. Another approach to quickly learning rules for a machine
translation system for low-density languages is treated in Jones, Havrilla [9]. In their approach, called the Sentence
Tableau, the information is again elicited from the user, who marks the sentence in their native language with
features such as number, gender, etc. The sentence pairs are then automatically turned into machine translation
rules.
3. NICE – the global picture
Our approach to the problem is NICE (Native language Interpretation and Communication Environment), a
machine translation project for low-density languages. The design is centered around the problem that for these
languages very little online data is available, and the financial resources for machine translation development are
limited. Hence we are forced to design new mechanisms to rapidly build machine translation systems. Our
approach is to combine known techniques (EBMT, later statistical MT, for more information see [1-3]) with a new
approach for automatic inference of transfer rules.
For the EBMT module, we are using the EBMT engine that is in use in the Pangloss system [3] 1. In
addition, we are developing statistical techniques for robust MT with sparse data using exponential models and joint
source-channel modeling. The focus of this paper is the third module, the elicitation tool that will automatically
learn transfer rules from a controlled corpus. A bilingual user who is not an expert in linguistics is asked to translate
a number of sentences and specify the word alignment between source and target language sentences. This is
achieved by using minimal pairs of sentences, i.e. sentences that differ only in one linguistic feature. The goal of the
learning process is to match every translation example in the elicited bilingual corpus with a transfer rule that
accounts for the translation and is of an appropriate level of abstraction. Generalization to the level of basic
categories (parts of speech) will serve as the fundamental step in the rule abstraction learning process. For instance,
after the system observes the translations of several example phrases of similar structure but different nouns and
adjectives, it will automatically infer that the transfer rules of these examples can be collapsed into a single transfer
rule. Our new method is based on the Version Space (VS) hypothesis formation technique [10]. It assumes a
hypothesis space with a partial order relation between the hypotheses. Using both positive and negative learning
instances, the new locally-constrained VS algorithm efficiently narrows down the Version Space – the space of
hypotheses that are consistent with all of the instances seen so far – until the hypothesis with the correct level of
abstraction is identified.
The elicitation tool and the EBMT and SMT modules will be combined in a multi-engine system, which
will utilize the strengths and overcome the weaknesses of each singular approach. Upon completion, the tool will
provide a means to build an initial MT system within a matter of human week. This system can then be improved
over months as more bilingual data becomes available.
4. Elicitation Corpus
The elicitation corpus for the NICE project contains lists of sentences in the source language (English,
Spanish). In the elicitation process, the user will translate a subset of these sentences, enough to learn the desired
grammar rules. The design of the elicitation corpus requires that it be compositional, so that smaller phrases form
the components of larger ones. In this way, smaller components such as phrases can be used as building blocks for
transfer rules of larger components such as causes. This is a requirement of the Version Space learning algorithm
that we intend to use for learning transfer rules. The elicitation corpus is also intended to cover major typological
features. Like Boas, NICE emulates the work of a field linguist [11]. Hence, it is intuitively reasonable to use
similar tools. We base our elicitation corpus on typological checklists such as [5].
As initially no bilingual dictionary is available, we start the elicitation process by asking the user for
translations of basic terms2. These terms are claimed to exist even in remote languages (e.g. tree, cloud, father, etc.).
This is important so that the user can actually translate the sentences and we can thus distinguish between features
that do not exist in a language and vocabulary that does not exist. In anticipation of noun classes, we also include
nouns that may have inherent gender (man, woman, etc.), natural and man-made objects, abstract concepts, etc.
In the following, more complex structures are elicited: noun phrases with or without modifiers, transitive
and intransitive sentences, etc. The pilot corpus includes about 800 sentences. It is beyond the scope of this paper
to give an exhaustive list of the elicited features. Suffice it to mention that they include, but are not limited to the
position of the adjective with respect to the head noun, the word order in answers, agreement of the verb with
definite versus indefinite object, questions, relative clauses and possessors.
5. Elicitation Tool
As mentioned above, we are developing a tool for rapid deployment of a transfer-based system, which we
call GLAD (for Generalized Linguistic Acquisition Device). In the elicitation tool learning occurs for two purposes:
as mentioned, we intend to use a seeded version space algorithm to automatically infer transfer rules between the
languages. The other type of learning occurs in order to guide the user through the elicitation process and minimize
the number of sentences that need to be translated. At the moment, our efforts focus on the latter type.
In order to minimize the required translations, the elicitation corpus first has to be ordered. Research on
Implicational Universals found that certain features are guaranteed not to exist in a language if certain other features
are not present. For instance, if a language does not mark plural, then it will also not mark dual or paucal (on
Implicational Universals see e.g. [4] and [8]). The tool will thus first inquire the existence of a plural, and only ask
about dual and paucal if plural was found to exist. In other cases, tests cannot be performed properly without
knowing the results of other tests. For example, the form of adjectives often depends on the noun class of their head
1
2
We are currently collecting data in Mapudungun, a language spoken in southern Chile, and Iñupiaq, spoken in Alaska.
This follows the tradition of the Swadesh List (named after the linguist Morris Swadesh)
noun. If we have not yet performed tests to determine the noun classes in a language, it would be difficult to learn
the morphology of adjectives. Our goal is thus to design a hierarchy of linguistic features. We have developed a
system design that allows for such a hierarchy. The system is similar to a tree, as the logical flow of the system can
be thought of as a depth-first search of this tree, where the states contain lists of tests. Figure 1 below illustrates the
basic design of the system.
Dual
Plural
Diagnostic
Tests
Paucal
…
S-V
agreement
…
…
Figure 1: Logical flow of data elicitation
The central dispatching unit is the state labeled as Diagnostic Tests. It contains tests that serve merely to
provide an initial idea of whether a linguistic feature exists in a language or not. For instance, if the feature plural is
detected in the diagnostic tests, control is passed to the state Plural, which will then perform tests that are related to
plural, including dependent features like dual and paucal. The system returns to the node Diagnostic Tests only after
the list of plural tests is exhausted.
The tests in each node have to be sequenced according to the hierarchy described above. For instance, the
state Diagnostic Tests first performs tests regarding noun classes, before it starts inquiring adjectival forms. In order
for previously elicited information to be useful to later tests, it is necessary that the states correspond with each
other. This is achieved by having a central Results unit. The results of all tests are written to this unit. Before
prompting the user for the translation of a minimal pair of sentences, this unit is then consulted to see whether the
minimal pair still presents an open question. For instance, if it is found that a language has no dual, a minimal pair
such as the following would become obsolete if the system is testing for possessor forms:
My two cats are brown.
My three cats are brown.
The above sentences would be expected to not result in a difference in the possessor, as we have already
established that the language does not differentiate between dual and plural. This again will prune the number of
sentences that the user is asked to translate.
6. GLAD Prototype
We have implemented a prototype of the elicitation tool. It consists of two major components. The first
component is the interface. Behind the interface runs the GLAD system. While this pilot version makes
Figure 2: Elicitation interface
simplifying assumptions3, it provides us with a testbed for future investigation.
A picture of the interface can be seen above in Figure 2.
The users are presented with a source language sentence. They then translate the sentence into their native
language. In the next step, the users specify how the words in the two sentences (source and target) align. This is
achieved by clicking on a word in one sentence (either source or target) and then clicking on the corresponding word
in the translation. The tool allows for the specification of one-to-one, one-to-many, many-to-one, and many-tomany alignments. It also allows for words to not have a correspondence in the other sentence at all. This flexibility
is necessary as all of the above alignments are possible between two languages. Consider the famous example of
noun-noun compounding in German, where nouns can be clustered to form a single word, for instance “Autotür”
(“car door”). In this example, two English words line up to one German word.
It should be noted in passing that while the interface is usually run in conjunction with GLAD, it can also be run
in batch mode. In this mode, it provides a useful tool for data collection. We are planning to exploit the alignment
given by the user to facilitate building a bilingual dictionary for EBMT.
When the interface is run in conjunction with GLAD, the user is never bothered with the processes of comparison,
learning, sequencing, etc., that are run in the background. All they see is a sequence of source language sentences
being displayed for translation.
We have several tests implemented for GLAD. One example of a currently implemented diagnostic test is plural.
First, the system provides the user with two minimal pairs of sentences. In each minimal pair, the two sentences
differ only in the number feature. For instance, consider the following sentences:
The egg fell.
The eggs fell.
They differ only in one linguistic feature, namely the number of the subject.
3
For example, in this first version we assume that the feature that is tested for is expressed as a morpheme rather than in a change
of word order. These remaining issues will be addressed in future versions of the system.
The GLAD system will (through the interface) obtain the translation of the above sentences. It will then
determine what the target-language equivalent of “egg”, “eggs”, “tree”, and “trees” is and perform pairwise
comparisons. If the system detects that the target language differentiates between “egg” and “eggs” and/or “tree”
and “trees”, it makes the logical flow decision to inquire further on the matter of plurals 4. So it moves to the
appropriate state Plural, where it starts a sequence of tests regarding dual, paucal, and so forth. When this sequence
is completed, control is returned to the state Diagnostic Tests, which continues down its scheduled list of tests.
Sample output of our system can be seen in the appendix.
A test is specified by a list of its minimal pairs of sentences together with the comparison criterion for each
minimal pair. In the example of plural, the comparison criterion would be the index in the source language sentence
that points to the noun whose form we examine in singular and plural. For example, consider the minimal pair
Two cats ran across the street.
Four cats ran across the street.
Here, the comparison criterion is 2, the (one-based counting) index of the word in question. This index is used to
determine the index of the target language word in question, which then leads to the target word.
As can be seen in the GLAD output, the word-by-word alignments are stored in a parenthetical notation. When
the user clicks at two corresponding words, this alignment is stored internally as a pair consisting of the indices of
the words in their respective sentences. This is done not only to compress the representation to a minimum; storing
the alignments as indices also ensures that no ambiguities are caused when a word occurs more than once in a
sentence.
The sentences and tests are stored separately, so that in the test we only store the index of each sentence. The
actual sentence is then retrieved from the database by this index. This ensures that sentences can be re-used for a
number of different tests. For example, the above sentence “The tree protected the animal” is used in the test for
number, but also in a further test for definiteness. Furthermore, it is useful to store the translation and alignments of
each sentence. This way, the user will only have to translate each sentence once, regardless of how many tests make
use of this sentence. This is an important feature to avoid user frustration.
7. Future work
To reach our ultimate goal of building a rapid-deployment tool that automatically learns transfer rules, we have to
expand and improve our prototype. The corpus will have to be expanded to cover more typological features, and
tests for these features will have to be integrated in the tool hierarchy.
Furthermore, we plan to implement mechanisms by which the sentence corpus can be tagged with features. Some
of these features are pre-determined, as for example definiteness. Other features will have to be determined by
performing tests, as for example noun classes. The tags will play an important role in the selection of tests. In the
case of the adjective tests, the tags will indicate which tests fall in one equivalence class (all the ones whose head
nouns are in the same noun class). For each class of tests, we then need not perform all available tests, but only one
in each class.
We are additionally planning to use our system in the near future to construct transfer rules using version space
learning as was described above. For this algorithm to work, we will make heavy use of the features each sentence
is tagged with. For instance, to be able to produce a transfer rule that applies only to adjective of definite nouns, we
need to know the part-of-speech tags of a sentence in addition to information about the definiteness of a sentence.
8. Conclusion
We believe that this work is significant to the future of machine translation for several reasons. In recent years,
the research community has recognized the importance of tools that can build machine translation systems fast.
Especially in the case of low-density languages, the communities as well as the governments have limited financial
resources that can be spent on the development of a machine translation system. We believe that NICE can provide a
tool that will make such extreme costs unnecessary. In addition, NICE does not require the existence of a large
bilingual dictionary, while its performance can be improved if such a corpus is available or can be collected over
time. It will be able to build an MT engine using very little data compared to other systems, especially rapiddeployment systems. Again, this is a crucial element when working with low-density languages.
It is important to note here that our paradigm of conducting MT research is a novelty in the field. We have
formed partnerships with native communities, and not only are they interested in our project, they are collecting and
archiving the data themselves. To summarize, our vision for the next 10 years is that MT can be learned quickly
from native speakers without linguistic expertise, which will allow MT to branch out to lower-density languages and
thus to more language families.
4
Note that these are only prototypical tests. In the future, we will not rely simply on the form of the head noun in singular and
plural, and will perform more diagnostic tests than two before ruling out a feature such as plural.
Bibliography
[1] Al-Onaizan, Y., Curin, J., Jahr, M., Knight, K., Lafferty, J., Melamed, D., Och, F.-J., Purdy, D., Smith, N. A.,
and Yarowsky, D. 1999.
Statistical Machine Translation, Final Report, JHU Workshop 1999. Technical Report, CLSP/JHU.
[2] Brown, Peter F.; et al. A Statistical Approach to Machine Translation. In: Computational Linguistics, 16(2):7985, 1990.
[3] Brown, Ralf. Example-Based Machine Translation in the Pangloss System. In: Proceedings of the 16 th
International Conference on Computational Linguistics (COLING-96).
[4] Comrie, Bernard. Language Universals & Linguistic Typology. The University of Chicago Press. 21989 [11981].
[5] Comrie, Bernard; N. Smith. Lingua Descriptive Series: Questionnaire. In: Lingua, 42:1-72, 1977.
[6] Frederking, Robert; A.Rudnicky; C. Hogan; K. Lenzo. Interactive Speech Translation in the DIPLOMAT
Project. In: Machine Translation Journal, Special issue on Spoken Language Translation. To appear.
[7] Frederking, Robert; A.Rudnicky; C. Hogan: Interactive Speech Translation in the DIPLOMAT Project. In:
Steven Krauwer et al. (eds.). Spoken Language Translation: Proceedings of a Workshop. Association of
Computational Linguistics and European Network in Language and Speech, 1997.
[8] Greenberg, Joseph H. 1966. Universals of language. 2 nd. ed. Cambridge, MA: MIT Press.
[9] Jones, Douglas; R. Havrilla. Twisted Pair Grammar: Support for Rapid Development of Machine Translation for
Low Density Languages. In: Proceedings of AMTA, 1998.
[10] Mitchell, Tom. Generalization as Search. In: Artificial Intelligence, 18:203-226, 1982.
[11] Nirenburg, Sergei. Project Boas: “A Linguist in the Box” as a Multi-Purpose Language. In: Proceedings of
LREC, 1998.
[12] Nirenburg, Sergei; V. Raskin. Universal Grammar and Lexis for Quick Ramp-Up of MT Systems. In:
Proceedings of COLING-ACL, 1998.
[13] Sherematyeva, Svetlana; S. Nirenburg. Towards a Universal Tool For NLP Resource Acquisition. In:
Proceedings of The Second International Conference on Language Resources and Evaluation (Greece,
Athens, May 31 -June 3, 2000)
[14] Somers, Harold L. Machine Translation and Minority Languages. In: Translating and the Computer 19: Papers
from the Aslib conference (London, November 1997).
Category:
DiagTests
Test:
Plural
First sentence of minimal pair:
Source:
The egg fell
Target:
Yai lilianguka
Alignment: ((1,0),(2,1),(3,2))
Target word1:
=>Yai
Second sentence of minimal pair:
Source:
The eggs fell
Target:
Mayai yalianguka
Alignment: ((1,0),(2,1),(3,2))
Target word2:
=>Mayai
=>The two words are not equal.
First sentence of minimal pair:
Source:
The tree protected the animal
Target:
Mti ulimkinga mnyama
Alignment: ((1,0),(2,1),(3,2),(4,0),(5,3))
Target word1:
=>Mti
Second sentence of minimal pair:
Source:
The trees protected the animal
Target:
Miti ilimkinga mnyama
Alignment: ((1,0),(2,1),(3,2),(4,0),(5,3))
Target word2:
=>Miti
=>The two words are not equal.
Feature Plural detected.
Category:
Plural
Test:
Dual
First sentence of minimal pair:
Source:
Two cats ran across the street
Target:
Paka wawili walivuka barabara
Alignment: ((1,2),(2,1),(3,3),(4,3),(5,0),(6,4))
Target word1:
=>Paka
Second sentence of minimal pair:
Source:
Four cats ran across the street
Target:
Paka wanne walikimbia na kuvuka barabara
Alignment: ((1,2),(2,1),(3,3),(4,4),(4,5),(5,0),(6,6))
Target word2:
=>Paka
=>The two words are equal.
First sentence of minimal pair:
Source:
Two men danced
Target:
wanaume wawili walicheza dansi
Alignment: ((1,2),(2,1),(3,3),(3,4))
Target word1:
=>wanaume
Second sentence of minimal pair:
Source:
Many men danced
Target:
wanaume wengi walicheza dansi
Alignment: ((1,2),(2,1),(3,3),(3,4))
Target word2:
=>wanaume
=>The two words are equal.
Feature Dual not detected.
Appendix: Sample output of elicitation tool. The target language here is Swahili.
Download