1. Introduction - University of Maryland Institute for Advanced

Collaborative Research: Interlingual Annotation of Multilingual Text Corpora 1. Introduction We propose research that aims at providing a well-defined, motivated and practical semantic level of representation that captures information from natural language text. We refer to this level of representation as an “interlingual representation”. This research will provide the basis for a paradigmatic shift enabling corpus based research as well as linguistic research into language-independent meaning representations in areas of natural language processing (NLP) such as machine translation, question answering and information retrieval. The novelty of the research comes not only from the interlingua representation itself, but also from an improved methodology for designing and evaluating such representations. The proposed research has four aspects.  First, we propose to compile a collection of texts for six or seven non-English languages—coupled with at least three translations into English. The non-English languages that may be included in our investigation are: (1) Arabic; (2) Chinese; (3) Spanish, (4) Persian, (5) Russian, (6) Japanese, and (7) French. These have been chosen based on the availability of corpora and NLP tools at or available to sites that are participating in this proposal.  Second, we propose an interlingual representation framework based on the careful study of these parallel text corpora. The framework will include a formal definition of the representation language along with coding manuals for the main components of meaning (e.g., even time, aspect, modalities, etc.). A key property of the representation framework will be meaning components that are richly designed, but also compatible with underspecification.  Third, we will annotate these bilingual corpora using the agreed-upon interlingual representation. This effort will also allow for a straightforward extension of those corpora without further research required.  Fourth, we will propose metrics for evaluating interlingual representations and for choosing a grainsize of meaning representation that is appropriate for a given task. The metrics are based on inter-coder reliability, the growth rate of the interlingual representation, and quality of the target language text that can be generated from the interlingua. The impact of this research comes from two areas, the depth of annotation and the evaluation metrics that delimit the annotation task. Together they enable research on both corpus-based methods and the modeling of language-independent meaning. To date, such research has been impossible, since corpora are annotated in a shallow manner, forcing researchers to choose between shallow approaches and hand-crafted approaches, each having its own set of problems. 1.1. Scientific Merit The scientific merit of this investigation lies in the definition of a level of semantic representation for natural language text – the “interlingua representation” – which captures important aspects of the meaning of different natural languages. This level of representation will be motivated from empirical work on corpora, and it will be defined in such a way that it can be used in the practical annotation of further, large corpora. It will be associated with an evaluation methodology which allows a researcher to determine the accuracy of an interlingua annotation for a given 1 text, and the grainsize of meaning representation appropriate for a given task. To date, no such level of representation has been defined, and no attempt at annotating corpora at such a level of representation has been made. 1.2. Broader Impact The broader impact of this research lies the critical mono- and multilingual resources it will provide, and in the resources that the defined interlingua will enable to be created in the future. Our interlingual framework will initially be shared by the project participants but is eventually to be used widely and distributed freely to researchers in the computational linguistics community as a whole. The resulting annotated, multilingual, parallel corpora will be useful as an empirical basis for a wide variety of research including the development and evaluation of interlingual NLP systems as well as a host of other research and development efforts in theoretical and applied linguistics, foreign language pedagogy, translation studies, and other related disciplines. We recognize the immense value of existing corpus annotation projects such as the Penn Treebank for English (Marcus et al 1993) and other syntactic tree banks distributed by the Linguistic Data Consortium, the Semeval data (Moore 1994), and the PropBank from University of Pennsylvania (Kingsbury and Palmer 2002) for progress in computational linguistics. In particular, these corpora have allowed for the use of machine learning tools (including stochastic methods), which have proved much better than hand-written rules at accounting for the vast empirical basis provided by natural language. However, machine learning approaches have in the past been restricted to fairly superficial phenomena. Our proposed effort will be the first corpora of any kind annotated with detailed interlingua – deep semantic information. When completed, the results may be used for guiding the implementation and evaluation of new or improved computational models of natural language processing or the development and evaluation of cognitive models of such processes. The corpora could be used for carrying out research in any number of areas of comparative linguistics, translation theory and language learning as well as, possibly, for training people in translation or language learning . It could also be used for training others to annotate texts for further research on interlingual representations. The combination of deep semantic information in the interlingua and of large corpora for machine learning-based approaches will provide a boost to NLP comparable to that provided by the first “shallow” corpora such as the Penn Treebank. In addition, a number of by-products of the corpus preparation activities should prove valuable to the research community (both computational linguistic and linguistic). For example, in the course of this project we will refine our suites of NLP tools for each language that we investigate. These include tokenizers, named-entity recognizers, normalizers for dates and temporal expressions, part-of-speech taggers, phrase-level segmenters, clause-level segmenters and alignment tools for each of those languages. Finally, this project should provide a useful environment for training future professionals in computational linguistics, machine translation, linguistics and translation. A large portion of the personnel on this project will be involved in the data preparation, analysis and annotation process—all of which provide practical, hands-on training in all four areas. 2. Objectives The immediate objective of our effort is to develop the computational infrastructure for annotating large multilingual parallel corpora with interlingual information in a consistent and reliable manner. Taking advantage of the computational, linguistic and language expertise of the participants, each site will be in charge of one (or in some cases perhaps two) languages and compile and annotate corpora using a common interlingual representation language. Each corpus will consist of a number of news articles in a source language along with multiple, independently generated human translations into English. 2 Once the translations have been created, the work will consist of two principal tasks. First, the parallel corpora will be segmented into translation units and aligned. The translations will then be compared, translation unit by translation unit, for any differences that may provide clues to the aspects of interlingua that are relevant for translation. Second, the corpora will be annotated for interlingual content and the results evaluated for accuracy and for consistency between annotations and between annotators. Each translation phenomenon will systematically be reviewed by project participants with the aim of establishing a standardized notation plus the corresponding treatment of the phenomenon in each language. These phenomena include the typical perspectival and interpretational differences one observes between languages with respect to the understanding of, for example, events, objects and object groupings, time, epistemic status (e.g., hypotheticality, desired state, etc.) and so on. A basic premise of this proposal is that it is only by systematically comparing these phenomena across several languages simultaneously, with real texts and translations at hand, that one can develop adequately powerful interlingual notations for them. We aim to improve NLP – and machine translation and multilingual language technologies in particular—through the use of linguistically motivated levels of semantic representation. Representations of meaning (as opposed to more surface-oriented components of language such as syntax) are often criticized when quick start-up is required (as in the case of rapid deployment of MT for new languages). However, well-designed meaning representations offer enormous potential gains in the quality of a system’s output as well as important long term scientific and technological advances (Mitamura et al. 1991, Hovy 2003, Philpot et al. 2002, Ambite et al. 2002, Dorr and Olsen 1997, Olsen et al. 1998, Habash and Dorr 2002, Waibel et al. 1997, Magerman 1995, Collins 1997). By providing criteria for evaluating the reliability and coverage of interlingual representations and providing an example of an interlingua that meets such criteria, and thus enabling the creation of corpora annotated with interlingual representations, the quality of interlingua-based language technologies will be significantly improved and their development time significantly reduced. The resulting interlingual framework would be useful not only for supporting seamless machine-mediated linguistic communication between people writing or speaking different languages but also for enabling virtually any language-based information seeking activity (information retrieval, information extraction, question-answering, information summarization, data mining, evidence extraction, the detection of significant relationships between situations and events, and so on). 3. Background This proposal concerns the creation and evaluation of interlingual representations for various multi-lingual applications of language technologies such as machine translation (MT), cross-language information retrieval, information extraction, question-answering, information summarization and evidence extraction. The main focus of this research, however, is on interlingual representations for MT. An interlingua is an intermediate language representation that can be used to mediate between source and target languages in machine translation. The advantages of using an interlingua for translation are well known. First, because each language has its own independent analyzer mapping it into the interlingua and generator mapping it out of the interlingua, any number of source and target languages can be connected without having to write explicit rules for each language pair and each direction. Thus, interlingual systems save both development time and reduce system size especially for bi-directional multilingual systems involving more than two languages. Second, an intermediate language representation can provide a neutral basis of comparison for translation equivalents that differ syntactically. In spite of these advantages, interlingual machine translation has not been not widely used in comparison to transfer-based machine translation or, more recently, systems based on statistical methods, which have been gaining popularity in all areas of language technology. One reason for this situation is that there is no commonly accepted theory of interlingua, and the problem is too big to address from scratch in the life span of a typical research or 3 development effort. However, we do not claim that a standard theory of interlingual representation would increase the popularity of interlingual representations. In fact, a very sticky central problem that any standard theory will have to account for is the fact that different aspects of interlingua are relevant for different applications of MT. But what is both necessary and feasible in the near term is the development of a methodology for interlingual representations. Guidelines for evaluating an interlingua would put boundaries on the problem and enable research projects to gain the benefits of linguistic knowledge instead of being forced to abandon it. Ideally, interlingual representations would have the following properties: Inter-coder compatibility: Two researchers, faced with the same piece of text, should be able to annotate it with compatible interlingual representations. Compatible encodings would not change meaning in a way that would cause system failure, but are not necessarily identical. For example, a natural language generator taking two compatible interlingua representations as input might produce the same output or two different but equally acceptable outputs. Compatibility is especially important in multi-site development efforts where, for example, a source language analyzer built in Italy might have to produce an interlingua that is compatible with a target language generator built in Korea. Granularity and coverage that are appropriate to the application: For any given application, it is not necessary to represent every aspect of interlingua. An interlingual representation that is too deep will take a long time to develop, may not meet the criterion of inter-coder compatibility, and may be difficult to produce reliably with NLP software. Conversely, an interlingual representation that is not detailed enough will lose distinctions that are necessary for the application. It is always necessary to strike a balance in order to build a running system. Striking a balance does not mean sacrificing theoretical correctness. It can, for example, involve a detailed theory that allows underspecification for non-critical details. To be well specified and most useful, these properties of interlingua require three distinct but related enterprises: 1. The representational formalism; this involves issues such as whether or not the phenomenon should be represented by a simple slot-filler pair, or, instead, have scope over a larger unit of representation, and where in relation to other representational units the phenomenon typically fits. 2. The representation content (structures, terms and symbols); this involves issues such as whether or not the values representing the phenomenon are discrete, if so, which the symbols to use and, if not, how the continuum is represented, how the values are determined, and what the relationship is between the symbols and lexical item definitions. 3. Examples of representation, tied to actual text (which is, of course, most useful with examples in various texts, in various languages) Developing such representations and supporting knowledge bases is not trivial, especially because there is often no obviously correct answer. To ensure the success of such an enterprise, we rely on two strengths unique to the team of participants in this proposal: (1) Agreement on a clear methodology for arriving at decisions regarding the interlingua, as indicated by the recent annotation experiment conducted for the workshop of SIG-IL (Special Interest Group on Interlinguas) (Habash 2002). (2) Complimentary research focii and synergy among the participants, as exemplified by a solid history of successful cross-site collaborations on the Pangloss MT project (LTI, CRL, ISI) (Farwell et al 1994), the Nitrogen natural language generator (ISI and UMIACS), the Mikrokosmos and Omega ontologies (ISI and CRL), the 2002 Johns Hopkins Summer Workshop on Generation for Machine Translation (UMIACS and Columbia), and the three workshops of the SIG-IL that have been held since 1998. 4 4. Proposed Program of Activities The central activity of this research effort is to carry out the development of a commonly shared, empirically motivated interlingual representation system based on a large comparative study of multiple translations (at least three) of 100 non-English documents in each of six different source languages into the same target language (English). The ultimate goal is to formulate guidelines on reliability and coverage of interlingua representations and thus to delimit the task of interlingua design. The following tasks are described in this section. 1. Collection and pre-processing of corpora 2. Delimiting the phenomena for annotation and designing the corpus markup language 3. Automated detection of dependency mismatches 4. Comparison of human translations 5. Coding by human annotators 6. Evaluation of the Annotated Corpus 7. Project Evaluation 4.1. Collection and Pre-Processing of Corpora As a preliminary step, prior to the comparative analysis and annotation tasks, the corpora of parallel texts will be gathered. Each corpus will consist of a number of texts (100 to 125 per language, totalling 50,000 to 100,000 words per language) from a given source language along with three independently prepared translations into English. This amount has been chosen because much of the corpora for Spanish, French, Japanese are already in place having been compiled as part of the 1994 DARPA MT evaluation (White & O’Connell 1994). Each consists of 100 news articles along with two translations into English, prepared independently by different translators. These will therefore require only one additional translation of each article. Chinese and Arabic corpora with multiple reference translations have been created for the DARPA TIDES program, and are available from LDC. For any remaining languages, such as Persian and Russian, the news articles will have to be selected and each text will then have to be translated into English by three different translators. Next, to assist in the analysis and annotation tasks, various text processing tools will be borrowed and modified (or, if necessary, developed) for NLP tasks including tokenization; recognition of named entities, temporal expressions, monetary expressions, and other phrases requiring some kind of normalization; morphological analysis and part-of-speech tagging; phrase boundary recognition; clause boundary recognition; sentence alignment, and word alignment. Such tools exist already at the participating sites for the languages we have chosen. These will be applied to automatically segment the data into clauses and phrases (considered the central units of translation) which will then be aligned (source language unit followed by corresponding target language unit in each of the three translations). After sentence level alignment of the source language and three human translations, we will have quadruples of sentences such as the following: Acumulación de víveres por anuncios sísmicos en Chile Hoarding caused by earthquake predictions in Chile Stockpiling of provisions because of predicted earthquakes in Chile Signs of earthquakes cause stockpiling of provisions in Chile 5 After tokenization, part-of-speech tagging, and clause level chunking, the following bracketed structures might result: [ [ [ Acumulación n] [ [ de p] [ [ víveres n] np] pp] np] [ [ por p] [ [ anuncios n] [ sísmicos adj] np] pp] [ [ en p] [ Chile pnp] pp] s] [ [ [ Hoarding n] np] [ caused v] [ [ by p] [ [ earthquake n] [ predictions n] np] pp] [ [ in p] [ Chile pnp] pp] s] [ [ [ Stockpiling n] [ [ of p] [ [ provisions n] np] pp] np] [ [ because of p] [ [ predicted psp] [ earthquakes n] np] pp] [ [ in p] [ Chile pnp] pp] s] [ [ [ Signs n] [ [ of p] [ [ earthquakes n] np] pp] np] [ cause v] [ [ Stockpiling n] [ [ of p] [ [ provisions n] np] pp] np] [ [ in p] [ Chile pnp] pp] s] 4.2. Delimiting the Phenomena for Annotation and Designing the Corpus Markup Language In parallel to corpus collection and pre-processing, the participating sites will come to an agreement on the interlingua subsystems for which the corpora will be annotated (e.g., thematic roles, temporal relations between events or states, reference types and coreference relations, rhetorical relations, modality, time and aspect, etc.) and on the procedure for marking up the corpus. We will choose approximately four phenomena for markup. Two initial candidates are events (identification of events independently of whether or not they are expressed as verbs) and thematic roles. The participating sites have conducted a pilot annotation experiment on thematic role markup on a mono-lingual corpus, which was the topic of a workshop at the conference of the American Machine Translation Association (AMTA) in October, 2002. A description of the annotation experiment can be found at: http://www.umiacs.umd.edu/~habash/il-wkshp/il-wkshp.html. In our experiment, at least one representative from each of our sites (eight annotators) were asked to assign thematic roles to each node in twenty syntactic dependency parse trees from the Penn Treebank (averaging a length of 25 words). Within one week of preparation, we had agreed on a set of thematic roles (e.g., AGT (agent), THM (theme), INS (instrument)) and achieved a cross-site inter-annotator agreement rate of 81% (Habash, 2002). We can base our further work on the consensus that was reached at the workshop concerning thematic role definitions and markup notation. The proposed work will, however, go further than the pilot experiment in considering bilingual corpora. The examination of bilingual data will enable us to address research hypotheses concerning cross-linguistic mis-matches in thematic roles. 4.3. Automated Detection of Dependency Mismatches Machine Translation Divergences are pairs of source and target language sentences that have the same meaning, but different syntax or dependency relations (Dorr, 1994). For example, the meaning of recent past expressed by the English adverb just, as in I just did my homework, can be expressed in French by a main verb plus particle and infinitive, venir de v-inf (come from v-inf). With a parallel, aligned corpus, the differences between source and target translation units, both in terms of lexical forms and word or constituent order, can be easily identified and classified. We propose to use automatic divergence annotation techniques to tag each source language/translation pair with a divergence type. These will ultimately be made available to the community for the purpose of cross-linguistic research and system development, e.g., for DARPA’s translingual effort in TIDES and follow-ons. Our approach will involve the application of DUSTer (Dorr et al. 2002)---University of Maryland's automatic annotation system---to each source language-translation pair in our corpora; these will be subsequently reviewed by hand for accuracy. 6 The DUSTer automated divergence annotation system can identify divergences such as a noun-modifier swap between the Spanish [[anuncios n] [sísmicos adj] np] and English [[predicted psp] [earthquakes n] np]. Such cases would be annotated with one of 35 pre-defined divergence type associated with a head swapping (divergences in which the concept corresponding to the syntactic head in one language does not correspond to the syntactic head in the other language). The resultant annotation is: ...[<DIV:6.FVar2B> [anuncios n] [sísmicos adj] </DIV:6.FVar2B> np]... …[<DIV:6.FVar2B> [predicted psp] [earthquakes n] </DIV:6.FVar2B> np]… where the DUSTer divergence rule associated with the annotation 6.FVar2B has a left-hand side that matches the Spanish structure, and a right-hand side that matches the English structure: 6.FVar2B: [[W1 n] [W2 mod]] > [[W1 mod] [W2 n]] Annotation of these structures in this way allows us to infer new classes of words associated with certain divergence types—and it provides a means for improving the performance of alignment for statistical processes later on. 4.4. Comparison of Human Translations After the corpora are prepared, we will examine differences between the translations of the same corpus. Differences in human translations may give us clues to which aspects of interlingua are important for translation. Variations between human translators can fall into three categories: 1. Translator errors 2. Meaningful alternatives due to differences in the translators’ beliefs about what is being said, how it is being said or why it is being said 3. Non-meaning bearing alternatives (free variants) For the text segment Acumulación de víveres por anuncios sísmicos en Chile, there are a number of lexical and syntactic variations in the three human translations. More importantly, of these variations, none are due to translator error; some are free variants (because of, caused by, and cause expressing the causation relation); and some may indicate differences in the translator’s beliefs such as the portrayal of accumulation as antisocial hoarding or prudent stockpiling. Since sentences that are equivalent in meaning for a given application can (but don’t need to) have identical interlingua representations, we are interested in the non-meaning bearing translation alternatives. These will allow us to formulate hypotheses about syntactic and lexical differences that could be neutralized in the interlingua. We will also pay close attention to meaningful alternatives in human translations as they may point to ambiguities or vagueness in the source text. These will allow us to formulate hypotheses about appropriate granularity of meaning representation in the interlingua. We will also be able to formulate hypotheses about which elements of meaning are inherent in the source text and which are open to interpretation by the reader, thus contributing to our goal of delimiting the seemingly open-ended task of interlingua design. 4.5. Coding by human annotators The next step is to annotate all three texts with respect to some aspect of interlingua, initially event and object representation. Suppose that in this case the task is to identify the events referred to or implied along with their associated thematic structure. The annotators would posit three central events (“amassing of provisions,” “predicting of earthquakes,” and “an earthquake”) and one state-of-affairs (the “amassing of provisions” is causally related to the “predicting of earthquakes”). In addition, the annotator would indicate that an amassing event has an agent, implicit at this point, and a theme, the provisions; that a predicting event has an agent, implicit at this point, and a theme, the 7 earthquake event; and that the future earthquake has a location, broadly speaking Chile. The amassing is the event caused and the predicting the causing event. To assist the annotation process both from the point of view of efficiency and from the point of view of consistency, an annotator’s interface will be developed, modified or extended to support the incipient mark up activity. The interface will be an early priority, with regular testing and improvements as requested by the participants. 4.6. Evaluation of the Annotated Corpus and of the Annotation Scheme We will perform various different types of evaluation throughout the project. The annotated corpora will be evaluated for the accuracy of the coding and inter-annotator agreement, using the usual measures such as kappa (Carletta 1996). Even for the simple example above, it would not be surprising to find variations. For instance, while one annotator might view predictions as a cause of the amassing, another might view an earthquake, albeit only a possible earthquake, as the cause. In any case, such differences will come to light during the evaluation phase. Sometimes, such differences may require changes to the notation or the interlingua symbol(s) representing that phenomenon in question. Having at hand examples of legitimately different interpretations, and corresponding suggestions for representing them, will facilitate the development of a robust and powerful interlingua. The fact that this work will be carried out not at one location, and not by one similarly trained team, is one of the novel aspects of the proposed work. Few if any other interlingua-construction projects have had this distributed nature. Evaluation of inter-coder reliability implies that at least some parts of the corpora must be annotated by two or more annotators. Since the corpora will involve several source langauges, and not all of the annotators will know all of the languages, we will construct a composite English corpus for the purpose of checking inter-coder compatibility. The composite corpus will have texts from each of the parallel corpora. In summary, the underlying assumption of the proposed research effort is that a comparative analysis of multiple translations of a text into the same language provides the soundest empirical basis for formulating a shared interlingua representation system and for annotating corpora for interlingua content. We propose to develop an additional metric for evaluation of interlinguas based on growth charts. A growth chart is a graph of interlingua growth as a function of how much data has been annotated. We have found growth charts to be diagnostic of strong and weak points of interlingua design and also to be an estimator of the complexity of one domain in comparison to another (Levin et al., 2002). The interlingua has a formal definition including the syntax of the interlingua, concept names, slot names, and slot values. After the formal definition has been established, the annotators work through the parallel corpora, possibly finding it necessary to add to or change the interlingua definition. The number of additions and changes to the interlingua definition will be plotted as a function of the amount of data that has been annotated. (Selecting sentences in random order and cross validation are useful in case some parts of the corpus are more complex than others.) Growth charts can be used to track coverage of the interlingua and to detect problems in granularity of meanings in the interlingua. If the plot has a steep slope with no sign of leveling off, it is clear that interlingual development is not complete possibly due to a level of granularity that is too fine. If this is the case, then the development cycle should be reinitiated-with a coarser degree of granularity-until the curve levels off in subsequent coverage evaluations. In order to facilitate the reduction of granularity we will design our interlingua to be semantically rich, but compatible with underspecification. 8 We will perform an extrinsic evaluation of our interlingual annotations by using them as input to natural language generators, and evaluating the output of the generator as if it were the output of a machine translation system. We will produce target language output from our interlingua using five generation systems: GHMT (Habash and Dorr, 2002); Halogen (Langkilde 2000); the KANT generator (Mitamura et al., 1991); FUF-SURGE (Elhadad and Robin 1992); and FERGUS (Bangalore et al 2001). On the face of it, the results of evaluating the output of these systems will tell us about the quality of both the input and the generators. However, by using a large number of generators, we can distinguish between effects due to the generator and effects due to the input representation: if all or most generators show an improvement from one input to another for the same target sentence, then we can conclude that the effect is due to the input representation, not the generator coincidentally having trouble with the original input representation. To evaluate the output of the generators, we will follow a two-pronged approach, paralleling the goals of the recent LREC-2002 MT Evaluation Frameworks presented at the workshop entitled “Human Evaluators Meet Automated Metrics”: http://www.issco.unige.ch/projects/isle/mteval-may02/mteval-lrec2002.pdf . The two types of evaluation are: (1)automatic evaluation techniques; and (2) quality judgments by humans. The utility of automatic measures is clear: they provide cheap, quick, repeatable, and objective evaluation. However, since human judges are the final reference in MT evaluation, the results of automated metrics must correlate well with (some aspect of) human-based evaluation. An automated approach to extrinsic evaluation of our framework will be undertaken using the Bleu technique developed at IBM (Papineni et al., 2001; Papineni, 2002; Papineni et al. 2002) among others. The metric was adapted for the recent the NIST MT Evaluation (Doddington, 2002). The principle of this metric, which is fully implemented, is to compute a distance between the candidate translation and a corpus of human “reference” translations of the source text. The distance is computed by averaging n-gram similarity between texts, for n = 1, 2, 3, 4 (higher values do not seem relevant). That is, if the the bi-grams (couples of consecutive words) and tri-grams of the candidate text are close to one or more of those in the reference translations, then the candidate scores high on the BLEU metric. Comparison of the results of this technique with human judgments on the same texts indicates that there is a correlation between human scores and Bleu scores (Papineni et al., 2001; 2002). Other automated evaluation techniques include MITRE’s NEE (Named Entity Evaluator) which compared MUC-style named entities in candidate and reference translations. The compilation of three sets of English references for 125 texts in each language provides an adequate basis for such an evaluation. Moreover, our plan for broad distribution of these multiple references will provide an ideal testbed for other NLP researchers who use Bleu and other automatic scoring methods—the reference translations are immediately reusable any time changes are made to an existing MT system (or an interlingual representation underlying such a system). This adds to the significance and usefulness of the corpora. We will also use the judgment-based measure of clarity (Vanni & Miller 2002), which merges the standard MT metrics of comprehensibility, readability, style, and clarity into a single evaluation feature. The primary question asked of the human judge is whether the sentence is immediately clear—akin to the question “Do you get it?” Since the feature of interest is clarity and not fidelity, it is sufficient that some clear meaning is expressed by the sentence and not that that meaning reflect the meaning of the input text. Thus, no reference to the source text or reference translation is required. This is an important benefit of the approach—in contrast to the automated Bleu technique where literally thousands of human reference translations are required for acceptable confidence levels (Papineni, personal communication). Note that the sentence need neither make sense in the context of the rest of the text, nor be grammatically well-formed; thus, the clarity score for a sentence is basically a snap judgment of the degree 9 to which some discernible meaning is conveyed by that sentence. Another crucial advantage of this technique is that it has been shown to correlate, surprisingly, with the metric of “fidelity.” Thus, the results of applying this metric mimics the results of judging closeness in meaning to the original source-language text—without requiring bilingual expertise on the part of the human judge. 5. Relation to PIs’ long-term goals and other work in progress Each of the participating sites (NMSU, UMD, MITRE, CMU, ISI and Columbia) has extensive experience in interlingual approaches to MT as well as in the use of interlinguas for other language technologies. However, each site has focused on different aspects of representing the meaning of texts. The following paragraphs describe the past and present projects of each research site in relation to the proposed research. New Mexico State University. For the research team at NMSU’s Computing Research Laboratory (CRL), this research project represents the first stage of a three stage research program into pragmatics-based MT. The larger, longer term effort includes in addition assembling the computational infrastructure for developing pragmatics-based NLP (and specifically MT) systems, developing a methodology for evaluating such systems, implementing one or more prototype, limited domain, pragmatics-based MT systems, and evaluating the performance of these prototype systems. This work grows out of a series of collaborations over the last ten years which have been aimed at developing the broad outline of a pragmatics-based approach to MT and a methodology for developing and testing pragmatics-based NLP systems. For the CRL group, the annotated multilingual corpus represents an important empirical basis for developing and testing a pragmatic inferencing mechanism, and a standarized interlingua allows any future results that may come out of our efforts to be used by other research groups. The proposed research program is in a symbiotic relationship with a number of other current and recent research projects at the CRL. The lab’s general approach to the full range of NLP applications has stressed knowledge-based approaches which exploit a common interlingua (Text Meaning Representation or TMR) for the purpose of representing and reasoning about the information communicated through text. These efforts include Mikrokosmos (1994-1998), a knowledge-based interlingual approach to Machine Translation which uses TMR as the pivot between analysis and generation. Keizai (1997-2000) is a cross-language information retrieval system which accepts queries in various languages and seeks relevant documents in a multilingual database of texts. The key strategy is to convert the the query terms to TMR concepts. The user then selects among the concepts and the results are used to generate key words in the different languages that serve as a basis for retrieval. The CRL’s approach to question-answering (2000-present) also relies on converting both query and text to TMR. Initially a knowledge base is constructed by converting information conveyed by relevant texts in different languages into the interlingua. The query is then converted into interlingua and that structure is used to extract a responsive interlingual structure from the knowledge base which, in turn, is used for formulating the answer in the language of the query. Not only do these efforts stand to be extended and improved by the proposed research but, if the resultant interlingua is sufficiently similar to TMR, the different systems described above might be more readily accessible to others in the NLP community. More importantly, both the TMR and the experience gained in developing and implementing interlinguas as a result of the above research efforts should be very beneficial to the proposed interlingua development effort. The MITRE Corporation continues to have efforts in machine translation, with a focus in low density languages. The Quick-MT project examined dictionary extraction for building MT lexicons and also for grammar learning through exemplars (Miller & Zajic 1998; Zajic & Miller 1998. Additionally, the resulting systems were incorporated into the MITRE prototype, CyberTrans which has since been transitioned to an operational system, and continues to 10 serve as a model for integrating multiple disparate translation engines (Miller et al. 2001). Currently, MITRE is working in collaboration with University of Maryland on Transforms. Transforms combines optical character recognition (OCR) techniques in a platform with MT, and allows component-level and system-level evaluation as well as investigation of the impact of component-level improvements on system-level performance. Additionally, the MITRE-sponsored research program, Foreign Language tool Improvement Through Evaluation (FLITE), is looking at evaluation methodologies for MT, combining these with learning processes, and improving the natural language generation aspect of MT. Finally, MITRE has been a driving force in the ISLE-NSF machine translation evaluation effort (Hovy & Reeder 2001; Vanni & Miller 2001). In addition to work specifically in machine translation, we have integrated multiple foreign language processing tools, such as named-entity taggers (Aberdeen et al., 1996; Aberdeen et al., 1995; Vilain, 1999; Vilain & Day, 1996). The DARPA-funded TIDES work showed large scale integration of research systems and an exploration of their interdependencies in the MiTAP system (Damianos et al., 2002a; Damianos et al., 2002b). A related area of ongoing research is that of temporal and geographic name normalization (Mani & Wilson, 2000; Ferro, 2001; Ferro et al. 2001), in which the team participated in the definition of a tagging standard along with the requisite tools to process the data. Another integration of MT is the Translating Instant Messenger (TrIM) prototype (Miller et al, 2001; Condon & Miller, 2002a; Condon & Miller, 2002b). Finally, MITRE is active in the area of summarization (Mani & Bloedorn, 1999). University of Maryland, College Park. The interlingual team at the University of Maryland has produced annotations as a part of their Divergence Unraveling for Statistical Translation (DUSTer) effort (Dorr et al 2002), in a large DARPA/ONR-funded Multi-University Research Initiative. DUSTer researchers are focused on enabling more accurate language-to-language alignment and projection of English dependency trees to a foreign language. These annotations are intended to resolve some of the most prevalent linguistic divergence cases by specifying what would be required to transform the sentence structure of one language to bear a closer resemblance to that of the other language. This effort is a descendant of earlier NSF-funded work– where a paradigm based on Lexical Conceptual Structure (LCS) was developed for representing predicate-argument structures and their associated conceptual units. The University of Maryland is currently developing automatic divergence annotation techniques based on the following principles:  every language pair has translation divergences that are easy to recognize,  knowing what they are and how to accommodate them provides the basis for refined word-level alignment,  refined word-level alignment results in improved projection of structural information from English to the foreign language. A divergence occurs when the underlying concepts or gist of a sentence is distributed over different words for different languages. For example, the notion of running into the room is expressed as “run into the room” in English and “move-in the room running” (entrar el cuarto corriendo) in Spanish. While seemingly transparent for human readers, this poses problems for statistical aligners. Finding a way to deal effectively with these divergences and repair them would be a massive advance for bilingual alignment and projection of dependency trees, e.g., for training of foreign-language parser/translation systems. Columbia University. Columbia has a long record of research in natural language generation (NLG) and related areas, such as multimedia information presentation and summarization. NLG usually starts from a non-linguistic level of meaning representation, and part of the task of research in NLG is to bridge the gap between domain meaning represented nonlinguistically, and constructs of the target language. The language-independence of the input 11 repersentation is particularly clear in multimedia generation, where the same initial representation is used by linguistic and graphical components (McKeown et al. 1998). Thus, researchers in NLG naturally deal with issues related to interlingua (though the term is not normally used in NLG). Recently, there has been interest in corpus-based methods, which are difficult in NLG because of the lack of corpora annotated with the kind of representations from which NLG usually starts. Work relates to extracting semantic, lexical and translingual information from unannotated corpora (Hatzivassiloglou and McKeown 1997, Fung and McKeown 1997, Barzilay and McKeown 2001), or training generators on annotated corpora using a variety of machine learning approaches (for example Bangalore et al 2001, Duboue and McKeown 2001, Kan and McKeown 2002, Walker et al 2001). Clearly, such work could be much extended if interlingua-annotated corpora were available, but the current research efforts do not support the creation of the necessary resources. Columbia personnel also have experience in directing annotation projects. Carnegie Mellon University. Carnegie Mellon University. CMU's Language Technologies Institute has pursued two types of interlingua design for the KANTOO and JANUS systems. The KANTOO project (Nyberg and Mitamura, 2000) focuses on high quality translation of technical texts using an interlingua that is based on predicate-argument structures. The KANTOO interlingua representation is designed for multi-lingual generation, and has been applied to Spanish, French, German, Italian and Portuguese. The Janus speech-to-speech translation systems have given us experience in three areas related to the proposed research – interlingua design, evaluation of interlinguas, and creation of a tagged interlingua database. The Janus research efforts (Enthusiast (Lavie et al., 1997; Gates et al., 1997; Qu et al., 1997), C-STAR (Levin et al., 2000), and NESPOLE (Lavie et al., 2002)) have resulted in an interlingua based on speaker intention, rather than literal meaning for spoken language translation systems. Spoken task-oriented language contains many formulaic expressions that are not translated literally. This has led us to a view of translation divergences based on their function or meaning. We have found that divergences occur with speech acts such as greeting and requesting, and with modal and aspectual meanings such as obligation, certainty, evidentiality, disposition, iteration, and habituality. In analysis and generation, we therefore take a construction-based approach (Fillmore and Kay, 1993) to these types of meanings, and our interlingua represents these concepts in a way that is independent of their syntactic expression (as main verbs, auxiliary verbs, affixes, adverbials, etc.) in the source and target languages. In the course of this proposed research we would like to continue to identify the types of meanings that are associated with translation divergences, and also study the types of syntactic constructions (formulaic or compositional) that are associated with those meanings. Because the C-STAR and NESPOLE projects are collaborativeinternational projects – C-STAR has seven partners, and NESPOLE has four – considerable effort has gone into designing an interlingua that is expressive enough to provide accurate translations, flexible enough to port to new semantic domains or scale up to larger domains, but at the same time simple enough to be used reliably by a diverse set of system developers who may not ever meet each other. We have therefore developed evaluation metrics for expressiveness, scalability, portability, and cross-site reliability (Levin et al., 2000, 2002). These metrics can be used as a starting point for the research proposed here, although they will have to be refined and reformulated to apply generally to any interlingua. 12 The C-STAR and NESPOLE databases contain dialogues in the semantic domains of travel planning and medical emergencies (chest pains and digestive problems). Dialogues were recorded and transcribed in English, German, Italian, Japanese, and Korean. Some of the non-English dialogues have been translated into English, and some remain monolingual in the database. Each utterance is broken into interlingua segment that roughly correspond to sentences, and each segment is tagged with an interlingua representation. There are around 10,000 tagged segments (sentences), which are used for interlingua and grammar development as well as for training of statistical methods (Langley 2002). Intercoder agreement experiments are described in (Levin et al. 2002). University of Southern California. The Natural Language Group at the Information Sciences Institute of the University of Southern California has performed research in MT and multilingual text processing for over a decade. Either in collaboration with others or their own, ISI researchers have built interlingual systems such as Pangloss (Farwell et al 1994; Spanish to English; with CMU and NMSU) and Gazelle (1996; Japanese to English), statistically trained systems such as Rewrite (Al-Onaizan et al 2000; ongoing; Arabic, Tetun, and later Chinese to English), and shallow systems such as QuTE (Lin and Hovy 1999; Bahasa Indonesia to English). At ISI our long-term plan is to find the optimal mixture(s) of statistical and symbolic/manual methods of creating the resources and transformation rules required for MT. It has been a long-standing goal to apply some of the statistical learning techniques that provide wide coverage but often somewhat lower quality or restricted performance (only short sentences, or inadequate pronominalization or proper name rendering, etc.) to an interlingually annotated translation pair, so that one can start overcoming the quality/limitation bottlenecks while maintainng the robustness so hard to achieve in purely manual approaches. Should the proposed annotated corpus be created, therefore, we will eagerly apply our latest MT learning techniques to it. Curent work on MT at ISI focuses primarily on the development of staistical learning techniques that support a variety of specific MT subtasks: proper name transliteration and translation, phrase unit recognition and translation, etc. A small additional project (Hovy et al 2003), currently drawing to a close, focuses on the creation of a website that organizes the complexity of MT evaluation measures—historically a rich, complex, and bewildering field of its own—into taxonomies that allow potential MT system evaluators more easily to decide what they should measure for their particular circumstances and how they should measure it (see http://www.isi.edu/natural-language/mteval/). Slightly broader than MT, but relating directly to interlingua, the semi-automated ontology construction research at ISI is developing a new large ontology called mega and a suite of tools for aligning terms into ontologies, extracting terms from text, discovering cross-relationships between terms, and mining ontological information from websites, dictionaries, and other text. 6. Relation to present state of knowledge in field and work in progress elsewhere There are three central areas of activity to which the proposed research effort relates: • Interlingua development: Text Meaning Representation (TMR), Interchange Format (IF), Lexical Conceptual Structure (LCS), Sentence Planning Language (SPL) and AMR, Universal Networking Language (UNL) • Data annotation and semantic networks: Penn Tree Bank, PropBank, FrameBank, Levin classes, Omega, Ontos. • Other MT approaches: direct (SYSTRAN), transfer (Metal), example-based (Japan, UMIST, CMU), stochastic (IBM, ISI, Germany); FAHQMT (and crummy MT), HAMT(CMU-controlled languages, CRL-Mikrokosmos), MAHT (RALI, Trados). Several centers for natural language processing and language technology development have or have had Machine Translation projects which have followed an interlingual approach or involved the development of interlinguas. These include Text Meaning Representation at the CRL and U. Maryland-Baltimore Campus, Lexical-Conceptual Structure at U. Maryland, Interchange Format at the LTI at CMU, Penman’s Sentence Planning Language (SPL) and its derivative AMR at ISI, and the Universal Networking Language (UNL) (http://www.unl.ias.unu.edu/) at several 13 centers around the world. The proposed research effort includes four of these groups and, therefore, is expected to have a major impact on these efforts. It is unlikely that the resultant interlingua will have all of the features of all of these interlinguas but it will clearly be informed by all of these efforts and it is expected to have a good deal of overlap with existing interlinguas. The primary contribution of the proposed effort to existing interlinguas is, on the one hand, to serve as a vehicle for unifying or standardizing them and, on the other, to provide an evaluation methodology and corpus for testing the coverage and accuracy of interlingual systems. In regard to the various other data collection and annotation efforts, including those related to the construction of ontologies or semantic nets, the proposed research program should, in general, fold in with them effectively. Virtually every semantically-oriented data collection and annotation effort focuses on some aspect of interlingua, whether that is conceptual structure (or word meaning or ontology construction), state or activity classification (or verb subcategorization or semantic classes), thematic roles (or verb case frames or valency), or propositional structure (predicate-argument structure). Finally, the proposed research is clearly related to other on-going efforts in MT, even ones that do not involve interlinguas. An interlingual approach to MT design and development has traditionally been juxtaposed to the other basic strategies for achieving fully-automated high quality MT: direct approaches as exemplified by most currently available operational systems such as SYSTRAN in its early versions, transfer approaches as exemplified by the remaining operational systems such as Metal and Systran in its more advanced version, and example-based approaches. With respect to all these efforts, the proposed research will be of interest and use to those systems (such as direct or transfer-based systems) which can and do make use of semantic information where possible. As for the other major classification of MT systems, i.e., the distinction between fully-automatic MT, human-assisted MT (HAMT, e.g., CMU-controlled languages, CRL-Pangloss) or machine assisted human translation (MAHT, e.g., RALI, Trados), the proposed research mainly offers an evaluation corpus of parallel text against which to test systems with each new version. Just as research in the various representational and lexical phenomena will inform our work, our work should inform these other efforts. The central difference between assigning, say, propositional content to a text in one language and propositional content within an interlingua markup is that while the former will need to account for presuppositions, entailments and default inferences, the latter, in addition, needs to account for translation equivalence relations and translation divergences as well. That is to say, interlingua markup needs to account for multilingual relationships as well as monolingual relationships. 7. Management Plan The research efforts will include corpus development, tool development, comparative analysis of translations, interlingua specification, corpus annotation, development of an evaluation methodology and evaluation. The proposed length of the project is 36 months. All of the efforts will be directed by the project PIs. The Gantt chart (following page) describes the overall project plan and schedule. Corpus development. For Spanish, French and Japanese, a third translation will be produced for each of the existing 125 texts in each source languages. Chinese and Arabic corpora with multiple English translations will be obtained from LDC. For all other languages, a source language corpus of 125 news articles will be compiled and then translated into English by three independent translators (5 weeks effort). Toolkits. This task will involve preparation or development of tools for each language, including a tokenizer, sentence boundary detector, named-entity recognizer, part-of-speech tagger, phrase and clause recognizers, alignment tools, and interface tools (6 months effort). 14 Preparing the corpora. This involves applying the various tools for automatically marking up, segmenting and aligning the texts followed by any hand correction needed for all the texts and translations in each corpus (1 month effort). Workshops. The research partners will hold a 2-day or 4-day workshop four times, with open participation, at which a proposal for the interlingua content for annotation will be presented, discussed, modified and adopted by the project participants. The workshop will also focus on the annotation methodology as well as the evaluation methodology. (1 month preparation for each workshop). 3 Cycles of annotation and evaluation of corpora. First, a comparative analysis of each source language text and its three translations will be carried out to identify and categorize each translation variation (3 months effort for each language). The annotation effort involves developing a common interface for annotating multilingual parallel corpora for interlingua content (6 months’ effort). In each cycle, the annotated corpora will be evaluated, and results will be compiled and reported (1 month effort). The toolkits and annotation interface will be revised and undated as necessary. (1-2 months’ effort in each cycle). The cycle of annotation and evaluation will be repeated three times. This cyclical planning-annotation-evaluation process will ensure the development of a resource that isconsistent as possible. Furthermore the openness of the planning and evaluation phases ensures that the resultant corpora will be usable by the largest number of groups possible. A final 2-day workshop will be held before or during month 35, which would focus on a final critical review of the evaluation methodology, the interlingua and the annotated corpus. Documentation. Each corpus, tag set, toolkit, comparative analysis, interlingua subsystem, annotated text, evaluation methodology and evaluation result will be documented and disseminated in a written publication or report. A final project report will also be prepared. Plan for documentation and sharing of research products All data, both raw data and annotated data, will be placed in the public domain and made accessible via the internet to any interested organization. In addition, the data will be made available through the Linguistic Data Consortium. The tools used for this research project will also be made available to the research community and other interested organizations via the project website from which they can be downloaded. Reports, both quantitative and qualitative, on the results of the comparative study for each source-target language corpus and for all corpora combined will be prepared and published at central scientific meetings (ACL, COLING) and workshops (AMTA IL workshop, Stanford AI Spring series, ARDA Northeast Regional Workshop). In addition, all preliminary results will be published as part of the CRL’s Memoranda in Computer and Cognitive Science series. . 15 16

1. Introduction - University of Maryland Institute for Advanced

Related documents

Products

Support

1. Introduction - University of Maryland Institute for Advanced

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib