Semi-Automated Elicitation Corpus Generation for Minority Languages

advertisement
Semi-Automated Corpus Generation for Minority Languages
Alison Alvarez
Jeff Good (MPI Leipzig)
nosila@cs.cmu.edu
good@eva.mpg.de
Max Planck Institute for Evolutionary Anthropology
Deutscher Platz 6
04103 Leipzig
Lori Levin
lsl@cs.cmu.edu
Robert Frederking
ref+@cs.cmu.edu
Erik Peterson
eepeter@cs.cmu.edu
Language Technologies Institute
Carnegie Mellon University
5000 Forbes Avenue
Pittsburgh, PA 15217
Abstract
In this document we will describe a semi-automated process for creating elicitation corpora. An
elicitation corpus is translated by a bilingual consultant in order to produce high quality word
aligned sentence pairs. The corpus sentences are automatically generated from detailed feature
structures using the GenKit generation program. Feature structures themselves are automatically
generated from information that is provided by a linguist using our corpus specification software.
This helps us to build small, flexible corpora for testing and development of machine translation
systems.
Keywords: corpora, elicitation, minor languages, generation
1 Introduction
In the field of Machine Translation
fully aligned and tagged translation corpora
are considered to be one of the most valuable resources for automatically training
translation systems. However, among minority languages such resources are hard to
find. It is possible to overcome this obstacle
by using techniques inspired by field linguistics. That is, by drawing on bilingual
informants to translate and align given sentences. We do this though a piece of
software called the elicitation tool (see Box 1
on the following page) that presents sentences and context clues to a bilingual
informant and collects translations and
alignments.
Our purpose is to use these resources to automatically discover the
typological features of a language. As a first
step we must design an elicitation corpus
composed of sentences that make typological discovery possible.
Taking into account the sheer diversity of typological systems in the languages
of the world, a representative typological
corpus would be too large to write by hand.
For example, systems for case marking can
vary wildly from language to language. In
many languages features like definiteness or
semantic role can influence case marking,
Furthermore, in Kartvelian languages 1 verb
aspect can also change case marking for the
subject. Thus, due to the unpredictability of
morphology and syntax we can’t always
count on prior linguistic knowledge to help
simplify the mapping of these typological
features. As a result, our search space is
greatly expanded. Our challenge is to gen-
languages do not all speak English. Latin
American minority language speakers would
be more likely to be bilingual with Spanish
rather than English. Thus, we would want a
version of the elicitation corpus available in
Spanish. We also want flexibility in lexical
selection to order to avoid cultural bias and
to choose appropriate lexical items for the
major language.
erate a typologically diverse corpus that will
accurately cover this typological space with
minimal human guidance. We will call this
type of corpus a functional-typological corpus.
Field linguists have relied on questionnaires that have remained relatively
static over a number of years. We want the
flexibility to change the questionnaire to reflect different semantic domains, different
goals for machine translation systems, different levels of detail, etc. We also want the
questionnaire to be available in multiple
languages. For example, speakers of minor
corpora. Specifically, we will be discussing
the use of feature structures both as a way of
covering a range of sentences and as a
framework for generation. Furthermore we
will address issues pertaining to generation
with open-ended lexical selections to add
flexibility among languages.
1
This family includes Georgian, Megrelian, Laz
and Svan.
This paper will look at methods for
specifying the scope and depth of an elicitation corpus as well as methods for quick
design and implementation of elicitation
2 Related Work
2 Elicitation Corpora
The functional-typological corpus
refers to an untranslated corpus in the major
Box 2: The elicitation tool is used by the consultant for translation and alignment.
Generation with language
specific GenKit Grammars
Feature
Structures
Elicitation from
informant
Maj. L. X
Min. Lang A
Min. Lang B
Maj. L. Y
Min. Lang C
Min. Lang D
Min. Lang E
Min. Lang F
Maj. L. Z
Major Languages
Minor Languages
Box 2: An illustration of the flexibility of the feature structures used for elicitation
language that is made up of a collection of
feature structures. These feature structures
can be then be used to generate a human
readable surface string. The makeup of the
feature structures is determined by two
things, a linguist-specified feature specification and a formalized description of desired
sentences.
2.1 Feature Structures
Feature structures are a multi-level
set of feature value pairs that are used to reflect the grammatical structures intended for
elicitation. The feature structures are designed in the style of Lexical-Functional
Grammar in that they express the grammatical relations of the major language (English
or Spanish) that is used for elicitation.
2.2 Feature Specification
The feature specification is originally
written in XML markup and then converted
to a machine-readable format. The purpose
of the feature specification is to define the
list of features and corresponding values that
are available for producing feature structures.
For example, the polarity feature carries the
value of positive and negative and can be
applied at the clause level. More complex
features express distinctions in time and aspect. The XML schema can express any
type of feature system. However, we have
focused on a functional-typological feature
set. The feature set is functional in the sense
that it describes functions like actor and undergoer rather than morpho-syntactic
realizations such as nominative and accusative. A process of feature detection (not
included in this demo) determines which
functions have morpho-syntactic realizations
in the minor language that is elicited.
((subj
((np-my-general-type pronoun)
(np-my-person person-unk)
(np-my-number num-sg)
(np-my-animacy anim-human)
(np-my-function fn-predicatee)…
(np-pronoun-exclusivity
exclusivity-n/a)))
(predicate ((adj-type adj-descriptive)))
(c-my-copula-type descriptive)
(c-my-secondary-type secondary-copula)
(c-my-polarity polarity-positive)
(c-my-general-type open-question)
(c-gap-function gap-copula-subject)…
(c-my-sp-act sp-act-request-information))
Box 3: A sample feature structure for the
sentence “Who is hungry?”
Additionally, the feature specification defines what feature-values cannot be
combined. For instance, exclusivity cannot
be applied to singular pronouns. Currently,
the feature specification contains about 50
features and a few hundred values.
2.3 Multiplies
The formalized description of the
desired set of feature structures is called a
"multiply". This description is written using
our GUI to select features and set their range
of values. Features can be specified with
just one set value, a list of values that will be
used alternated throughout the set of generated feature structures, or have the string
'#all' written next to it. This indicates that
all values are to be multiplied out into the
set of feature structures. Furthermore, disjoint statements can be used to create
structures where one or more features vary
in tandem. See Box 5 for an illustration of
the details.
3 Generation
Generation is performed using GenKit
generation software. It takes the feature
structures along with a corresponding
grammar and lexicon and generates a surface string along with a comment.
Generated comments are used to show pieces of meaning that might not be evident in
the not be evident in the major/source language but may be found in the
target/minority language. For example, the
first person singular pronoun in English does
not carry gender, so a comment will be generated indicating that “I = gender-female” or
“I = gender-male”. When using the elicitation tool this information is presented to the
bilingual informant using the context field.
Generation itself is done using the same
GUI used for feature structure generation.
An optional window is used to gather the
resulting sentences and save them in elicitation tool format.
4 References
Bouquiaux, Luc and J.M.C. Thomas. 1992.
Studying and Describing Unwritten Languages. Dallas, TX: The Summer Institute
of Linguistcs.
Comrie, Bernard and N. Smith. 1977. Lingua descriptive series: Questionnaire. In:
Lingua, 42:1-72.
Levin, L., R. Vega, J. Carbonell, R. Brown,
A. Lavie, E. Canulef and C. Huenchullan.
June 2002. “Data Collection and Language
Technologies for Mapudungun”. In Proceedings of International Workshop on
Resources and Tools in Field Linguistics at
the Third International Conference on Language Resources and Evaluation (LREC2002), Las Palmas, Canary Islands, Spain.
Probst, K., R. Brown, J. Carbonell, A. Lavie,
L. Levin, and E. Peterson. September 2001.
“Design and Implementation of Controlled
Elicitation for Machine Translation of Lowdensity Languages”. In Proceedings of the
MT-2010 Workshop at MT-Summit VIII,
Santiago de Compostela, Spain.
((subj ((np-my-general-type pronoun-type common-noun-type)
(np-my-person person-first person-second person-third)
“Multiply out by these lists of values”
(np-my-number num-sg num-pl)
(np-my-biological-gender bio-gender-male bio-gender-female)
(np-my-function fn-predicatee)))
{[(predicate ((np-my-general-type common-noun-type)
(np-my-definiteness definiteness-minus) (np-my-person person-third)
Disjoint set of copula
(np-my-function predicate))) (c-my-copula-type role)]
types and their predi[(predicate ((adj-my-general-type quality-type))) (c-my-copula-type attributive)]
cates
[(predicate ((np-my-general-type common-noun-type)
(np-my-person person-third) (np-my-definiteness definiteness-plus)
(np-my-function predicate))) (c-my-copula-type identity)]}
“Use all values of polarity”
(c-my-secondary-type secondary-copula) (c-my-polarity #all)
(c-my-function fn-main-clause)(c-my-general-type declarative)
(c-my-speech-act sp-act-state) (c-v-my-grammatical-aspect gram-aspect-neutral)
(c-v-my-lexical-aspect state) (c-v-my-absolute-tense past present future)
(c-v-my-phase-aspect durative))
Box 4: a multiply specification used to define a set of copula sentences with all combinations of
tense, and subject person, number and gender
Download