Semi-Automated Corpus Generation for Minority Languages Alison Alvarez Jeff Good (MPI Leipzig) nosila@cs.cmu.edu good@eva.mpg.de Max Planck Institute for Evolutionary Anthropology Deutscher Platz 6 04103 Leipzig Lori Levin lsl@cs.cmu.edu Robert Frederking ref+@cs.cmu.edu Erik Peterson eepeter@cs.cmu.edu Language Technologies Institute Carnegie Mellon University 5000 Forbes Avenue Pittsburgh, PA 15217 Abstract In this document we will describe a semi-automated process for creating elicitation corpora. An elicitation corpus is translated by a bilingual consultant in order to produce high quality word aligned sentence pairs. The corpus sentences are automatically generated from detailed feature structures using the GenKit generation program. Feature structures themselves are automatically generated from information that is provided by a linguist using our corpus specification software. This helps us to build small, flexible corpora for testing and development of machine translation systems. Keywords: corpora, elicitation, minor languages, generation 1 Introduction In the field of Machine Translation fully aligned and tagged translation corpora are considered to be one of the most valuable resources for automatically training translation systems. However, among minority languages such resources are hard to find. It is possible to overcome this obstacle by using techniques inspired by field linguistics. That is, by drawing on bilingual informants to translate and align given sentences. We do this though a piece of software called the elicitation tool (see Box 1 on the following page) that presents sentences and context clues to a bilingual informant and collects translations and alignments. Our purpose is to use these resources to automatically discover the typological features of a language. As a first step we must design an elicitation corpus composed of sentences that make typological discovery possible. Taking into account the sheer diversity of typological systems in the languages of the world, a representative typological corpus would be too large to write by hand. For example, systems for case marking can vary wildly from language to language. In many languages features like definiteness or semantic role can influence case marking, Furthermore, in Kartvelian languages 1 verb aspect can also change case marking for the subject. Thus, due to the unpredictability of morphology and syntax we can’t always count on prior linguistic knowledge to help simplify the mapping of these typological features. As a result, our search space is greatly expanded. Our challenge is to gen- languages do not all speak English. Latin American minority language speakers would be more likely to be bilingual with Spanish rather than English. Thus, we would want a version of the elicitation corpus available in Spanish. We also want flexibility in lexical selection to order to avoid cultural bias and to choose appropriate lexical items for the major language. erate a typologically diverse corpus that will accurately cover this typological space with minimal human guidance. We will call this type of corpus a functional-typological corpus. Field linguists have relied on questionnaires that have remained relatively static over a number of years. We want the flexibility to change the questionnaire to reflect different semantic domains, different goals for machine translation systems, different levels of detail, etc. We also want the questionnaire to be available in multiple languages. For example, speakers of minor corpora. Specifically, we will be discussing the use of feature structures both as a way of covering a range of sentences and as a framework for generation. Furthermore we will address issues pertaining to generation with open-ended lexical selections to add flexibility among languages. 1 This family includes Georgian, Megrelian, Laz and Svan. This paper will look at methods for specifying the scope and depth of an elicitation corpus as well as methods for quick design and implementation of elicitation 2 Related Work 2 Elicitation Corpora The functional-typological corpus refers to an untranslated corpus in the major Box 2: The elicitation tool is used by the consultant for translation and alignment. Generation with language specific GenKit Grammars Feature Structures Elicitation from informant Maj. L. X Min. Lang A Min. Lang B Maj. L. Y Min. Lang C Min. Lang D Min. Lang E Min. Lang F Maj. L. Z Major Languages Minor Languages Box 2: An illustration of the flexibility of the feature structures used for elicitation language that is made up of a collection of feature structures. These feature structures can be then be used to generate a human readable surface string. The makeup of the feature structures is determined by two things, a linguist-specified feature specification and a formalized description of desired sentences. 2.1 Feature Structures Feature structures are a multi-level set of feature value pairs that are used to reflect the grammatical structures intended for elicitation. The feature structures are designed in the style of Lexical-Functional Grammar in that they express the grammatical relations of the major language (English or Spanish) that is used for elicitation. 2.2 Feature Specification The feature specification is originally written in XML markup and then converted to a machine-readable format. The purpose of the feature specification is to define the list of features and corresponding values that are available for producing feature structures. For example, the polarity feature carries the value of positive and negative and can be applied at the clause level. More complex features express distinctions in time and aspect. The XML schema can express any type of feature system. However, we have focused on a functional-typological feature set. The feature set is functional in the sense that it describes functions like actor and undergoer rather than morpho-syntactic realizations such as nominative and accusative. A process of feature detection (not included in this demo) determines which functions have morpho-syntactic realizations in the minor language that is elicited. ((subj ((np-my-general-type pronoun) (np-my-person person-unk) (np-my-number num-sg) (np-my-animacy anim-human) (np-my-function fn-predicatee)… (np-pronoun-exclusivity exclusivity-n/a))) (predicate ((adj-type adj-descriptive))) (c-my-copula-type descriptive) (c-my-secondary-type secondary-copula) (c-my-polarity polarity-positive) (c-my-general-type open-question) (c-gap-function gap-copula-subject)… (c-my-sp-act sp-act-request-information)) Box 3: A sample feature structure for the sentence “Who is hungry?” Additionally, the feature specification defines what feature-values cannot be combined. For instance, exclusivity cannot be applied to singular pronouns. Currently, the feature specification contains about 50 features and a few hundred values. 2.3 Multiplies The formalized description of the desired set of feature structures is called a "multiply". This description is written using our GUI to select features and set their range of values. Features can be specified with just one set value, a list of values that will be used alternated throughout the set of generated feature structures, or have the string '#all' written next to it. This indicates that all values are to be multiplied out into the set of feature structures. Furthermore, disjoint statements can be used to create structures where one or more features vary in tandem. See Box 5 for an illustration of the details. 3 Generation Generation is performed using GenKit generation software. It takes the feature structures along with a corresponding grammar and lexicon and generates a surface string along with a comment. Generated comments are used to show pieces of meaning that might not be evident in the not be evident in the major/source language but may be found in the target/minority language. For example, the first person singular pronoun in English does not carry gender, so a comment will be generated indicating that “I = gender-female” or “I = gender-male”. When using the elicitation tool this information is presented to the bilingual informant using the context field. Generation itself is done using the same GUI used for feature structure generation. An optional window is used to gather the resulting sentences and save them in elicitation tool format. 4 References Bouquiaux, Luc and J.M.C. Thomas. 1992. Studying and Describing Unwritten Languages. Dallas, TX: The Summer Institute of Linguistcs. Comrie, Bernard and N. Smith. 1977. Lingua descriptive series: Questionnaire. In: Lingua, 42:1-72. Levin, L., R. Vega, J. Carbonell, R. Brown, A. Lavie, E. Canulef and C. Huenchullan. June 2002. “Data Collection and Language Technologies for Mapudungun”. In Proceedings of International Workshop on Resources and Tools in Field Linguistics at the Third International Conference on Language Resources and Evaluation (LREC2002), Las Palmas, Canary Islands, Spain. Probst, K., R. Brown, J. Carbonell, A. Lavie, L. Levin, and E. Peterson. September 2001. “Design and Implementation of Controlled Elicitation for Machine Translation of Lowdensity Languages”. In Proceedings of the MT-2010 Workshop at MT-Summit VIII, Santiago de Compostela, Spain. ((subj ((np-my-general-type pronoun-type common-noun-type) (np-my-person person-first person-second person-third) “Multiply out by these lists of values” (np-my-number num-sg num-pl) (np-my-biological-gender bio-gender-male bio-gender-female) (np-my-function fn-predicatee))) {[(predicate ((np-my-general-type common-noun-type) (np-my-definiteness definiteness-minus) (np-my-person person-third) Disjoint set of copula (np-my-function predicate))) (c-my-copula-type role)] types and their predi[(predicate ((adj-my-general-type quality-type))) (c-my-copula-type attributive)] cates [(predicate ((np-my-general-type common-noun-type) (np-my-person person-third) (np-my-definiteness definiteness-plus) (np-my-function predicate))) (c-my-copula-type identity)]} “Use all values of polarity” (c-my-secondary-type secondary-copula) (c-my-polarity #all) (c-my-function fn-main-clause)(c-my-general-type declarative) (c-my-speech-act sp-act-state) (c-v-my-grammatical-aspect gram-aspect-neutral) (c-v-my-lexical-aspect state) (c-v-my-absolute-tense past present future) (c-v-my-phase-aspect durative)) Box 4: a multiply specification used to define a set of copula sentences with all combinations of tense, and subject person, number and gender