1. Introduction - University of Maryland Institute for Advanced

advertisement
Collaborative Research: Interlingual Annotation
of Multilingual Text Corpora
1. Introduction
We propose research that aims at providing a well-defined, motivated and practical semantic level of representation
that captures information from natural language text. We refer to this level of representation as an “interlingual
representation”. This research will provide the basis for a paradigmatic shift enabling corpus based research as well
as linguistic research into language-independent meaning representations in areas of natural language processing
(NLP) such as machine translation, question answering and information retrieval. The novelty of the research comes
not only from the interlingua representation itself, but also from an improved methodology for designing and
evaluating such representations.
The proposed research has four aspects.

First, we propose to compile a collection of texts for six or seven non-English languages—coupled with at
least three translations into English. The non-English languages that may be included in our investigation
are: (1) Arabic; (2) Chinese; (3) Spanish, (4) Persian, (5) Russian, (6) Japanese, and (7) French. These have
been chosen based on the availability of corpora and NLP tools at or available to sites that are participating in
this proposal.

Second, we propose an interlingual representation framework based on the careful study of these parallel text
corpora. The framework will include a formal definition of the representation language along with coding
manuals for the main components of meaning (e.g., even time, aspect, modalities, etc.). A key property of the
representation framework will be meaning components that are richly designed, but also compatible with
underspecification.

Third, we will annotate these bilingual corpora using the agreed-upon interlingual representation. This effort
will also allow for a straightforward extension of those corpora without further research required.

Fourth, we will propose metrics for evaluating interlingual representations and for choosing a grainsize of
meaning representation that is appropriate for a given task. The metrics are based on inter-coder reliability,
the growth rate of the interlingual representation, and quality of the target language text that can be generated
from the interlingua.
The impact of this research comes from two areas, the depth of annotation and the evaluation metrics that delimit the
annotation task. Together they enable research on both corpus-based methods and the modeling of
language-independent meaning. To date, such research has been impossible, since corpora are annotated in a shallow
manner, forcing researchers to choose between shallow approaches and hand-crafted approaches, each having its own
set of problems.
1.1.
Scientific Merit
The scientific merit of this investigation lies in the definition of a level of semantic representation for natural language
text – the “interlingua representation” – which captures important aspects of the meaning of different natural
languages. This level of representation will be motivated from empirical work on corpora, and it will be defined in
such a way that it can be used in the practical annotation of further, large corpora. It will be associated with an
evaluation methodology which allows a researcher to determine the accuracy of an interlingua annotation for a given
1
text, and the grainsize of meaning representation appropriate for a given task. To date, no such level of representation
has been defined, and no attempt at annotating corpora at such a level of representation has been made.
1.2.
Broader Impact
The broader impact of this research lies the critical mono- and multilingual resources it will provide, and in the
resources that the defined interlingua will enable to be created in the future. Our interlingual framework will initially
be shared by the project participants but is eventually to be used widely and distributed freely to researchers in the
computational linguistics community as a whole. The resulting annotated, multilingual, parallel corpora will be useful
as an empirical basis for a wide variety of research including the development and evaluation of interlingual NLP
systems as well as a host of other research and development efforts in theoretical and applied linguistics, foreign
language pedagogy, translation studies, and other related disciplines.
We recognize the immense value of existing corpus annotation projects such as the Penn Treebank for English
(Marcus et al 1993) and other syntactic tree banks distributed by the Linguistic Data Consortium, the Semeval data
(Moore 1994), and the PropBank from University of Pennsylvania (Kingsbury and Palmer 2002) for progress in
computational linguistics. In particular, these corpora have allowed for the use of machine learning tools (including
stochastic methods), which have proved much better than hand-written rules at accounting for the vast empirical basis
provided by natural language. However, machine learning approaches have in the past been restricted to fairly
superficial phenomena. Our proposed effort will be the first corpora of any kind annotated with detailed interlingua –
deep semantic information. When completed, the results may be used for guiding the implementation and evaluation
of new or improved computational models of natural language processing or the development and evaluation of
cognitive models of such processes. The corpora could be used for carrying out research in any number of areas of
comparative linguistics, translation theory and language learning as well as, possibly, for training people in translation
or language learning . It could also be used for training others to annotate texts for further research on interlingual
representations.
The combination of deep semantic information in the interlingua and of large corpora for machine learning-based
approaches will provide a boost to NLP comparable to that provided by the first “shallow” corpora such as the Penn
Treebank. In addition, a number of by-products of the corpus preparation activities should prove valuable to the
research community (both computational linguistic and linguistic). For example, in the course of this project we will
refine our suites of NLP tools for each language that we investigate. These include tokenizers, named-entity
recognizers, normalizers for dates and temporal expressions, part-of-speech taggers, phrase-level segmenters,
clause-level segmenters and alignment tools for each of those languages.
Finally, this project should provide a useful environment for training future professionals in computational linguistics,
machine translation, linguistics and translation. A large portion of the personnel on this project will be involved in the
data preparation, analysis and annotation process—all of which provide practical, hands-on training in all four areas.
2. Objectives
The immediate objective of our effort is to develop the computational infrastructure for annotating large multilingual
parallel corpora with interlingual information in a consistent and reliable manner. Taking advantage of the
computational, linguistic and language expertise of the participants, each site will be in charge of one (or in some cases
perhaps two) languages and compile and annotate corpora using a common interlingual representation language. Each
corpus will consist of a number of news articles in a source language along with multiple, independently generated
human translations into English.
2
Once the translations have been created, the work will consist of two principal tasks. First, the parallel corpora will be
segmented into translation units and aligned. The translations will then be compared, translation unit by translation
unit, for any differences that may provide clues to the aspects of interlingua that are relevant for translation.
Second, the corpora will be annotated for interlingual content and the results evaluated for accuracy and for
consistency between annotations and between annotators. Each translation phenomenon will systematically be
reviewed by project participants with the aim of establishing a standardized notation plus the corresponding treatment
of the phenomenon in each language. These phenomena include the typical perspectival and interpretational
differences one observes between languages with respect to the understanding of, for example, events, objects and
object groupings, time, epistemic status (e.g., hypotheticality, desired state, etc.) and so on. A basic premise of this
proposal is that it is only by systematically comparing these phenomena across several languages simultaneously, with
real texts and translations at hand, that one can develop adequately powerful interlingual notations for them.
We aim to improve NLP – and machine translation and multilingual language technologies in particular—through the
use of linguistically motivated levels of semantic representation. Representations of meaning (as opposed to more
surface-oriented components of language such as syntax) are often criticized when quick start-up is required (as in the
case of rapid deployment of MT for new languages). However, well-designed meaning representations offer enormous
potential gains in the quality of a system’s output as well as important long term scientific and technological advances
(Mitamura et al. 1991, Hovy 2003, Philpot et al. 2002, Ambite et al. 2002, Dorr and Olsen 1997, Olsen et al. 1998,
Habash and Dorr 2002, Waibel et al. 1997, Magerman 1995, Collins 1997). By providing criteria for evaluating the
reliability and coverage of interlingual representations and providing an example of an interlingua that meets such
criteria, and thus enabling the creation of corpora annotated with interlingual representations, the quality of
interlingua-based language technologies will be significantly improved and their development time significantly
reduced. The resulting interlingual framework would be useful not only for supporting seamless machine-mediated
linguistic communication between people writing or speaking different languages but also for enabling virtually any
language-based information seeking activity (information retrieval, information extraction, question-answering,
information summarization, data mining, evidence extraction, the detection of significant relationships between
situations and events, and so on).
3. Background
This proposal concerns the creation and evaluation of interlingual representations for various multi-lingual
applications of language technologies such as machine translation (MT), cross-language information retrieval,
information extraction, question-answering, information summarization and evidence extraction. The main focus of
this research, however, is on interlingual representations for MT.
An interlingua is an intermediate language representation that can be used to mediate between source and target
languages in machine translation. The advantages of using an interlingua for translation are well known. First, because
each language has its own independent analyzer mapping it into the interlingua and generator mapping it out of the
interlingua, any number of source and target languages can be connected without having to write explicit rules for each
language pair and each direction. Thus, interlingual systems save both development time and reduce system size
especially for bi-directional multilingual systems involving more than two languages. Second, an intermediate
language representation can provide a neutral basis of comparison for translation equivalents that differ syntactically.
In spite of these advantages, interlingual machine translation has not been not widely used in comparison to
transfer-based machine translation or, more recently, systems based on statistical methods, which have been gaining
popularity in all areas of language technology. One reason for this situation is that there is no commonly accepted
theory of interlingua, and the problem is too big to address from scratch in the life span of a typical research or
3
development effort. However, we do not claim that a standard theory of interlingual representation would increase the
popularity of interlingual representations. In fact, a very sticky central problem that any standard theory will have to
account for is the fact that different aspects of interlingua are relevant for different applications of MT. But what is
both necessary and feasible in the near term is the development of a methodology for interlingual representations.
Guidelines for evaluating an interlingua would put boundaries on the problem and enable research projects to gain the
benefits of linguistic knowledge instead of being forced to abandon it.
Ideally, interlingual representations would have the following properties:
Inter-coder compatibility: Two researchers, faced with the same piece of text, should be able to annotate it with
compatible interlingual representations. Compatible encodings would not change meaning in a way that would
cause system failure, but are not necessarily identical. For example, a natural language generator taking two
compatible interlingua representations as input might produce the same output or two different but equally
acceptable outputs. Compatibility is especially important in multi-site development efforts where, for example, a
source language analyzer built in Italy might have to produce an interlingua that is compatible with a target
language generator built in Korea.
Granularity and coverage that are appropriate to the application: For any given application, it is not
necessary to represent every aspect of interlingua. An interlingual representation that is too deep will take a long
time to develop, may not meet the criterion of inter-coder compatibility, and may be difficult to produce reliably
with NLP software. Conversely, an interlingual representation that is not detailed enough will lose distinctions
that are necessary for the application. It is always necessary to strike a balance in order to build a running system.
Striking a balance does not mean sacrificing theoretical correctness. It can, for example, involve a detailed theory
that allows underspecification for non-critical details.
To be well specified and most useful, these properties of interlingua require three distinct but related enterprises:
1.
The representational formalism; this involves issues such as whether or not the phenomenon should be
represented by a simple slot-filler pair, or, instead, have scope over a larger unit of representation, and where in
relation to other representational units the phenomenon typically fits.
2.
The representation content (structures, terms and symbols); this involves issues such as whether or not the values
representing the phenomenon are discrete, if so, which the symbols to use and, if not, how the continuum is
represented, how the values are determined, and what the relationship is between the symbols and lexical item
definitions.
3.
Examples of representation, tied to actual text (which is, of course, most useful with examples in various texts, in
various languages)
Developing such representations and supporting knowledge bases is not trivial, especially because there is often no
obviously correct answer. To ensure the success of such an enterprise, we rely on two strengths unique to the team of
participants in this proposal: (1) Agreement on a clear methodology for arriving at decisions regarding the
interlingua, as indicated by the recent annotation experiment conducted for the workshop of SIG-IL (Special Interest
Group on Interlinguas) (Habash 2002). (2) Complimentary research focii and synergy among the participants, as
exemplified by a solid history of successful cross-site collaborations on the Pangloss MT project (LTI, CRL, ISI)
(Farwell et al 1994), the Nitrogen natural language generator (ISI and UMIACS), the Mikrokosmos and Omega
ontologies (ISI and CRL), the 2002 Johns Hopkins Summer Workshop on Generation for Machine Translation
(UMIACS and Columbia), and the three workshops of the SIG-IL that have been held since 1998.
4
4. Proposed Program of Activities
The central activity of this research effort is to carry out the development of a commonly shared, empirically
motivated interlingual representation system based on a large comparative study of multiple translations (at least
three) of 100 non-English documents in each of six different source languages into the same target language (English).
The ultimate goal is to formulate guidelines on reliability and coverage of interlingua representations and thus to
delimit the task of interlingua design.
The following tasks are described in this section.
1.
Collection and pre-processing of corpora
2.
Delimiting the phenomena for annotation and designing the corpus markup language
3.
Automated detection of dependency mismatches
4.
Comparison of human translations
5.
Coding by human annotators
6.
Evaluation of the Annotated Corpus
7.
Project Evaluation
4.1.
Collection and Pre-Processing of Corpora
As a preliminary step, prior to the comparative analysis and annotation tasks, the corpora of parallel texts will be
gathered. Each corpus will consist of a number of texts (100 to 125 per language, totalling 50,000 to 100,000 words
per language) from a given source language along with three independently prepared translations into English. This
amount has been chosen because much of the corpora for Spanish, French, Japanese are already in place having been
compiled as part of the 1994 DARPA MT evaluation (White & O’Connell 1994). Each consists of 100 news articles
along with two translations into English, prepared independently by different translators. These will therefore require
only one additional translation of each article. Chinese and Arabic corpora with multiple reference translations have
been created for the DARPA TIDES program, and are available from LDC. For any remaining languages, such as
Persian and Russian, the news articles will have to be selected and each text will then have to be translated into English
by three different translators.
Next, to assist in the analysis and annotation tasks, various text processing tools will be borrowed and modified (or, if
necessary, developed) for NLP tasks including tokenization; recognition of named entities, temporal expressions,
monetary expressions, and other phrases requiring some kind of normalization; morphological analysis and
part-of-speech tagging; phrase boundary recognition; clause boundary recognition; sentence alignment, and word
alignment. Such tools exist already at the participating sites for the languages we have chosen. These will be applied
to automatically segment the data into clauses and phrases (considered the central units of translation) which will then
be aligned (source language unit followed by corresponding target language unit in each of the three translations).
After sentence level alignment of the source language and three human translations, we will have quadruples of
sentences such as the following:
Acumulación de víveres por anuncios sísmicos en Chile
Hoarding caused by earthquake predictions in Chile
Stockpiling of provisions because of predicted earthquakes in Chile
Signs of earthquakes cause stockpiling of provisions in Chile
5
After tokenization, part-of-speech tagging, and clause level chunking, the following bracketed structures might result:
[ [ [ Acumulación n] [ [ de p] [ [ víveres n] np] pp] np] [ [ por p] [ [ anuncios n] [ sísmicos adj] np] pp] [ [ en p] [
Chile pnp] pp] s]
[ [ [ Hoarding n] np] [ caused v] [ [ by p] [ [ earthquake n] [ predictions n] np] pp] [ [ in p] [ Chile pnp] pp] s]
[ [ [ Stockpiling n] [ [ of p] [ [ provisions n] np] pp] np] [ [ because of p] [ [ predicted psp] [ earthquakes n] np] pp]
[ [ in p] [ Chile pnp] pp] s]
[ [ [ Signs n] [ [ of p] [ [ earthquakes n] np] pp] np] [ cause v] [ [ Stockpiling n] [ [ of p] [ [ provisions n] np] pp] np]
[ [ in p] [ Chile pnp] pp] s]
4.2.
Delimiting the Phenomena for Annotation and Designing the Corpus Markup Language
In parallel to corpus collection and pre-processing, the participating sites will come to an agreement on the interlingua
subsystems for which the corpora will be annotated (e.g., thematic roles, temporal relations between events or states,
reference types and coreference relations, rhetorical relations, modality, time and aspect, etc.) and on the procedure for
marking up the corpus.
We will choose approximately four phenomena for markup. Two initial candidates are events (identification of events
independently of whether or not they are expressed as verbs) and thematic roles. The participating sites have
conducted a pilot annotation experiment on thematic role markup on a mono-lingual corpus, which was the topic of a
workshop at the conference of the American Machine Translation Association (AMTA) in October, 2002. A
description of the annotation experiment can be found at:
http://www.umiacs.umd.edu/~habash/il-wkshp/il-wkshp.html.
In our experiment, at least one representative from each of our sites (eight annotators) were asked to assign thematic
roles to each node in twenty syntactic dependency parse trees from the Penn Treebank (averaging a length of 25
words). Within one week of preparation, we had agreed on a set of thematic roles (e.g., AGT (agent), THM (theme),
INS (instrument)) and achieved a cross-site inter-annotator agreement rate of 81% (Habash, 2002).
We can base our further work on the consensus that was reached at the workshop concerning thematic role definitions
and markup notation. The proposed work will, however, go further than the pilot experiment in considering bilingual
corpora. The examination of bilingual data will enable us to address research hypotheses concerning cross-linguistic
mis-matches in thematic roles.
4.3.
Automated Detection of Dependency Mismatches
Machine Translation Divergences are pairs of source and target language sentences that have the same meaning, but
different syntax or dependency relations (Dorr, 1994). For example, the meaning of recent past expressed by the
English adverb just, as in I just did my homework, can be expressed in French by a main verb plus particle and
infinitive, venir de v-inf (come from v-inf). With a parallel, aligned corpus, the differences between source and target
translation units, both in terms of lexical forms and word or constituent order, can be easily identified and classified.
We propose to use automatic divergence annotation techniques to tag each source language/translation pair with a
divergence type. These will ultimately be made available to the community for the purpose of cross-linguistic research
and system development, e.g., for DARPA’s translingual effort in TIDES and follow-ons. Our approach will involve
the application of DUSTer (Dorr et al. 2002)---University of Maryland's automatic annotation system---to each source
language-translation pair in our corpora; these will be subsequently reviewed by hand for accuracy.
6
The DUSTer automated divergence annotation system can identify divergences such as a noun-modifier swap
between the Spanish [[anuncios n] [sísmicos adj] np] and English [[predicted psp] [earthquakes n] np]. Such cases
would be annotated with one of 35 pre-defined divergence type associated with a head swapping (divergences in
which the concept corresponding to the syntactic head in one language does not correspond to the syntactic head in the
other language). The resultant annotation is:
...[<DIV:6.FVar2B> [anuncios n] [sísmicos adj] </DIV:6.FVar2B> np]...
…[<DIV:6.FVar2B> [predicted psp] [earthquakes n] </DIV:6.FVar2B> np]…
where the DUSTer divergence rule associated with the annotation 6.FVar2B has a left-hand side that matches the
Spanish structure, and a right-hand side that matches the English structure:
6.FVar2B: [[W1 n] [W2 mod]] > [[W1 mod] [W2 n]]
Annotation of these structures in this way allows us to infer new classes of words associated with certain divergence
types—and it provides a means for improving the performance of alignment for statistical processes later on.
4.4.
Comparison of Human Translations
After the corpora are prepared, we will examine differences between the translations of the same corpus. Differences
in human translations may give us clues to which aspects of interlingua are important for translation. Variations
between human translators can fall into three categories:
1.
Translator errors
2.
Meaningful alternatives due to differences in the translators’ beliefs about what is being said, how it is being
said or why it is being said
3.
Non-meaning bearing alternatives (free variants)
For the text segment Acumulación de víveres por anuncios sísmicos en Chile, there are a number of lexical and
syntactic variations in the three human translations. More importantly, of these variations, none are due to translator
error; some are free variants (because of, caused by, and cause expressing the causation relation); and some may
indicate differences in the translator’s beliefs such as the portrayal of accumulation as antisocial hoarding or prudent
stockpiling.
Since sentences that are equivalent in meaning for a given application can (but don’t need to) have identical
interlingua representations, we are interested in the non-meaning bearing translation alternatives. These will allow us
to formulate hypotheses about syntactic and lexical differences that could be neutralized in the interlingua.
We will also pay close attention to meaningful alternatives in human translations as they may point to ambiguities or
vagueness in the source text. These will allow us to formulate hypotheses about appropriate granularity of meaning
representation in the interlingua. We will also be able to formulate hypotheses about which elements of meaning are
inherent in the source text and which are open to interpretation by the reader, thus contributing to our goal of
delimiting the seemingly open-ended task of interlingua design.
4.5.
Coding by human annotators
The next step is to annotate all three texts with respect to some aspect of interlingua, initially event and object
representation. Suppose that in this case the task is to identify the events referred to or implied along with their
associated thematic structure. The annotators would posit three central events (“amassing of provisions,” “predicting
of earthquakes,” and “an earthquake”) and one state-of-affairs (the “amassing of provisions” is causally related to the
“predicting of earthquakes”). In addition, the annotator would indicate that an amassing event has an agent, implicit at
this point, and a theme, the provisions; that a predicting event has an agent, implicit at this point, and a theme, the
7
earthquake event; and that the future earthquake has a location, broadly speaking Chile. The amassing is the event
caused and the predicting the causing event.
To assist the annotation process both from the point of view of efficiency and from the point of view of consistency, an
annotator’s interface will be developed, modified or extended to support the incipient mark up activity. The interface
will be an early priority, with regular testing and improvements as requested by the participants.
4.6.
Evaluation of the Annotated Corpus and of the Annotation Scheme
We will perform various different types of evaluation throughout the project.
The annotated corpora will be evaluated for the accuracy of the coding and inter-annotator agreement, using the
usual measures such as kappa (Carletta 1996). Even for the simple example above, it would not be surprising to find
variations. For instance, while one annotator might view predictions as a cause of the amassing, another might view an
earthquake, albeit only a possible earthquake, as the cause. In any case, such differences will come to light during the
evaluation phase.
Sometimes, such differences may require changes to the notation or the interlingua symbol(s) representing that
phenomenon in question. Having at hand examples of legitimately different interpretations, and corresponding
suggestions for representing them, will facilitate the development of a robust and powerful interlingua. The fact that
this work will be carried out not at one location, and not by one similarly trained team, is one of the novel aspects of
the proposed work. Few if any other interlingua-construction projects have had this distributed nature.
Evaluation of inter-coder reliability implies that at least some parts of the corpora must be annotated by two or more
annotators. Since the corpora will involve several source langauges, and not all of the annotators will know all of the
languages, we will construct a composite English corpus for the purpose of checking inter-coder compatibility. The
composite corpus will have texts from each of the parallel corpora.
In summary, the underlying assumption of the proposed research effort is that a comparative analysis of multiple
translations of a text into the same language provides the soundest empirical basis for formulating a shared interlingua
representation system and for annotating corpora for interlingua content.
We propose to develop an additional metric for evaluation of interlinguas based on growth charts. A growth chart is
a graph of interlingua growth as a function of how much data has been annotated. We have found growth charts to be
diagnostic of strong and weak points of interlingua design and also to be an estimator of the complexity of one domain
in comparison to another (Levin et al., 2002).
The interlingua has a formal definition including the syntax of the interlingua, concept names, slot names, and slot
values. After the formal definition has been established, the annotators work through the parallel corpora, possibly
finding it necessary to add to or change the interlingua definition. The number of additions and changes to the
interlingua definition will be plotted as a function of the amount of data that has been annotated. (Selecting sentences
in random order and cross validation are useful in case some parts of the corpus are more complex than others.)
Growth charts can be used to track coverage of the interlingua and to detect problems in granularity of meanings in the
interlingua. If the plot has a steep slope with no sign of leveling off, it is clear that interlingual development is not
complete possibly due to a level of granularity that is too fine. If this is the case, then the development cycle should be
reinitiated-with a coarser degree of granularity-until the curve levels off in subsequent coverage evaluations. In order
to facilitate the reduction of granularity we will design our interlingua to be semantically rich, but compatible with
underspecification.
8
We will perform an extrinsic evaluation of our interlingual annotations by using them as input to natural language
generators, and evaluating the output of the generator as if it were the output of a machine translation system. We will
produce target language output from our interlingua using five generation systems: GHMT (Habash and Dorr, 2002);
Halogen (Langkilde 2000); the KANT generator (Mitamura et al., 1991); FUF-SURGE (Elhadad and Robin 1992);
and FERGUS (Bangalore et al 2001). On the face of it, the results of evaluating the output of these systems will tell us
about the quality of both the input and the generators. However, by using a large number of generators, we can
distinguish between effects due to the generator and effects due to the input representation: if all or most generators
show an improvement from one input to another for the same target sentence, then we can conclude that the effect is
due to the input representation, not the generator coincidentally having trouble with the original input representation.
To evaluate the output of the generators, we will follow a two-pronged approach, paralleling the goals of the recent
LREC-2002 MT Evaluation Frameworks presented at the workshop entitled “Human Evaluators Meet Automated
Metrics”: http://www.issco.unige.ch/projects/isle/mteval-may02/mteval-lrec2002.pdf . The two types of evaluation
are: (1)automatic evaluation techniques; and (2) quality judgments by humans. The utility of automatic measures is
clear: they provide cheap, quick, repeatable, and objective evaluation. However, since human judges are the final
reference in MT evaluation, the results of automated metrics must correlate well with (some aspect of) human-based
evaluation.
An automated approach to extrinsic evaluation of our framework will be undertaken using the Bleu technique
developed at IBM (Papineni et al., 2001; Papineni, 2002; Papineni et al. 2002) among others. The metric was adapted
for the recent the NIST MT Evaluation (Doddington, 2002). The principle of this metric, which is fully implemented,
is to compute a distance between the candidate translation and a corpus of human “reference” translations of the
source text. The distance is computed by averaging n-gram similarity between texts, for n = 1, 2, 3, 4 (higher values do
not seem relevant). That is, if the the bi-grams (couples of consecutive words) and tri-grams of the candidate text are
close to one or more of those in the reference translations, then the candidate scores high on the BLEU metric.
Comparison of the results of this technique with human judgments on the same texts indicates that there is a
correlation between human scores and Bleu scores (Papineni et al., 2001; 2002). Other automated evaluation
techniques include MITRE’s NEE (Named Entity Evaluator) which compared MUC-style named entities in candidate
and reference translations.
The compilation of three sets of English references for 125 texts in each language provides an adequate basis for such
an evaluation. Moreover, our plan for broad distribution of these multiple references will provide an ideal testbed for
other NLP researchers who use Bleu and other automatic scoring methods—the reference translations are immediately
reusable any time changes are made to an existing MT system (or an interlingual representation underlying such a
system). This adds to the significance and usefulness of the corpora.
We will also use the judgment-based measure of clarity (Vanni & Miller 2002), which merges the standard MT
metrics of comprehensibility, readability, style, and clarity into a single evaluation feature. The primary question
asked of the human judge is whether the sentence is immediately clear—akin to the question “Do you get it?”
Since the feature of interest is clarity and not fidelity, it is sufficient that some clear meaning is expressed by the
sentence and not that that meaning reflect the meaning of the input text. Thus, no reference to the source text or
reference translation is required. This is an important benefit of the approach—in contrast to the automated Bleu
technique where literally thousands of human reference translations are required for acceptable confidence levels
(Papineni, personal communication). Note that the sentence need neither make sense in the context of the rest of the
text, nor be grammatically well-formed; thus, the clarity score for a sentence is basically a snap judgment of the degree
9
to which some discernible meaning is conveyed by that sentence.
Another crucial advantage of this technique is that it has been shown to correlate, surprisingly, with the metric of
“fidelity.” Thus, the results of applying this metric mimics the results of judging closeness in meaning to the original
source-language text—without requiring bilingual expertise on the part of the human judge.
5.
Relation to PIs’ long-term goals and other work in progress
Each of the participating sites (NMSU, UMD, MITRE, CMU, ISI and Columbia) has extensive experience in
interlingual approaches to MT as well as in the use of interlinguas for other language technologies. However, each site
has focused on different aspects of representing the meaning of texts. The following paragraphs describe the past and
present projects of each research site in relation to the proposed research.
New Mexico State University. For the research team at NMSU’s Computing Research Laboratory (CRL), this
research project represents the first stage of a three stage research program into pragmatics-based MT. The larger,
longer term effort includes in addition assembling the computational infrastructure for developing pragmatics-based
NLP (and specifically MT) systems, developing a methodology for evaluating such systems, implementing one or
more prototype, limited domain, pragmatics-based MT systems, and evaluating the performance of these prototype
systems. This work grows out of a series of collaborations over the last ten years which have been aimed at developing
the broad outline of a pragmatics-based approach to MT and a methodology for developing and testing
pragmatics-based NLP systems. For the CRL group, the annotated multilingual corpus represents an important
empirical basis for developing and testing a pragmatic inferencing mechanism, and a standarized interlingua allows
any future results that may come out of our efforts to be used by other research groups.
The proposed research program is in a symbiotic relationship with a number of other current and recent research
projects at the CRL. The lab’s general approach to the full range of NLP applications has stressed knowledge-based
approaches which exploit a common interlingua (Text Meaning Representation or TMR) for the purpose of
representing and reasoning about the information communicated through text. These efforts include Mikrokosmos
(1994-1998), a knowledge-based interlingual approach to Machine Translation which uses TMR as the pivot between
analysis and generation. Keizai (1997-2000) is a cross-language information retrieval system which accepts queries in
various languages and seeks relevant documents in a multilingual database of texts. The key strategy is to convert the
the query terms to TMR concepts. The user then selects among the concepts and the results are used to generate key
words in the different languages that serve as a basis for retrieval. The CRL’s approach to question-answering
(2000-present) also relies on converting both query and text to TMR. Initially a knowledge base is constructed by
converting information conveyed by relevant texts in different languages into the interlingua. The query is then
converted into interlingua and that structure is used to extract a responsive interlingual structure from the knowledge
base which, in turn, is used for formulating the answer in the language of the query.
Not only do these efforts stand to be extended and improved by the proposed research but, if the resultant interlingua is
sufficiently similar to TMR, the different systems described above might be more readily accessible to others in the
NLP community. More importantly, both the TMR and the experience gained in developing and implementing
interlinguas as a result of the above research efforts should be very beneficial to the proposed interlingua development
effort.
The MITRE Corporation continues to have efforts in machine translation, with a focus in low density languages.
The Quick-MT project examined dictionary extraction for building MT lexicons and also for grammar learning
through exemplars (Miller & Zajic 1998; Zajic & Miller 1998. Additionally, the resulting systems were incorporated
into the MITRE prototype, CyberTrans which has since been transitioned to an operational system, and continues to
10
serve as a model for integrating multiple disparate translation engines (Miller et al. 2001). Currently, MITRE is
working in collaboration with University of Maryland on Transforms. Transforms combines optical character
recognition (OCR) techniques in a platform with MT, and allows component-level and system-level evaluation as well
as investigation of the impact of component-level improvements on system-level performance. Additionally, the
MITRE-sponsored research program, Foreign Language tool Improvement Through Evaluation (FLITE), is looking at
evaluation methodologies for MT, combining these with learning processes, and improving the natural language
generation aspect of MT. Finally, MITRE has been a driving force in the ISLE-NSF machine translation evaluation
effort (Hovy & Reeder 2001; Vanni & Miller 2001).
In addition to work specifically in machine translation, we have integrated multiple foreign language processing tools,
such as named-entity taggers (Aberdeen et al., 1996; Aberdeen et al., 1995; Vilain, 1999; Vilain & Day, 1996). The
DARPA-funded TIDES work showed large scale integration of research systems and an exploration of their
interdependencies in the MiTAP system (Damianos et al., 2002a; Damianos et al., 2002b). A related area of ongoing
research is that of temporal and geographic name normalization (Mani & Wilson, 2000; Ferro, 2001; Ferro et al.
2001), in which the team participated in the definition of a tagging standard along with the requisite tools to process
the data. Another integration of MT is the Translating Instant Messenger (TrIM) prototype (Miller et al, 2001;
Condon & Miller, 2002a; Condon & Miller, 2002b). Finally, MITRE is active in the area of summarization (Mani &
Bloedorn, 1999).
University of Maryland, College Park. The interlingual team at the University of Maryland has produced
annotations as a part of their Divergence Unraveling for Statistical Translation (DUSTer) effort (Dorr et al 2002), in a
large DARPA/ONR-funded Multi-University Research Initiative. DUSTer researchers are focused on enabling more
accurate language-to-language alignment and projection of English dependency trees to a foreign language. These
annotations are intended to resolve some of the most prevalent linguistic divergence cases by specifying what would
be required to transform the sentence structure of one language to bear a closer resemblance to that of the other
language. This effort is a descendant of earlier NSF-funded work– where a paradigm based on Lexical Conceptual
Structure (LCS) was developed for representing predicate-argument structures and their associated conceptual units.
The University of Maryland is currently developing automatic divergence annotation techniques based on the
following principles:

every language pair has translation divergences that are easy to recognize,

knowing what they are and how to accommodate them provides the basis for refined word-level alignment,

refined word-level alignment results in improved projection of structural information from English to the foreign
language.
A divergence occurs when the underlying concepts or gist of a sentence is distributed over different words for different
languages. For example, the notion of running into the room is expressed as “run into the room” in English and
“move-in the room running” (entrar el cuarto corriendo) in Spanish. While seemingly transparent for human readers,
this poses problems for statistical aligners. Finding a way to deal effectively with these divergences and repair them
would be a massive advance for bilingual alignment and projection of dependency trees, e.g., for training of
foreign-language parser/translation systems.
Columbia University. Columbia has a long record of research in natural language generation (NLG) and related
areas, such as multimedia information presentation and summarization. NLG usually starts from a non-linguistic level
of meaning representation, and part of the task of research in NLG is to bridge the gap between domain meaning
represented nonlinguistically, and constructs of the target language. The language-independence of the input
11
repersentation is particularly clear in multimedia generation, where the same initial representation is used by linguistic
and graphical components (McKeown et al. 1998). Thus, researchers in NLG naturally deal with issues related to
interlingua (though the term is not normally used in NLG).
Recently, there has been interest in corpus-based methods, which are difficult in NLG because of the lack of corpora
annotated with the kind of representations from which NLG usually starts. Work relates to extracting semantic, lexical
and translingual information from unannotated corpora (Hatzivassiloglou and McKeown 1997, Fung and McKeown
1997, Barzilay and McKeown 2001), or training generators on annotated corpora using a variety of machine learning
approaches (for example Bangalore et al 2001, Duboue and McKeown 2001, Kan and McKeown 2002, Walker et al
2001). Clearly, such work could be much extended if interlingua-annotated corpora were available, but the current
research efforts do not support the creation of the necessary resources. Columbia personnel also have experience in
directing annotation projects.
Carnegie Mellon University. Carnegie Mellon University. CMU's Language Technologies Institute has
pursued two types of interlingua design for the KANTOO and JANUS systems.
The KANTOO project (Nyberg and Mitamura, 2000) focuses on high quality translation of technical texts using an
interlingua that is based on predicate-argument structures. The KANTOO interlingua representation is designed for
multi-lingual generation, and has been applied to Spanish, French, German, Italian and Portuguese.
The Janus speech-to-speech translation systems have given us experience in three areas related to the proposed
research – interlingua design, evaluation of interlinguas, and creation of a tagged interlingua database.
The Janus research efforts (Enthusiast (Lavie et al., 1997; Gates et al., 1997; Qu et al., 1997), C-STAR (Levin et al.,
2000), and NESPOLE (Lavie et al., 2002)) have resulted in an interlingua based on speaker intention, rather than
literal meaning for spoken language translation systems. Spoken task-oriented language contains many formulaic
expressions that are not translated literally. This has led us to a view of translation divergences based on their function
or meaning. We have found that divergences occur with speech acts such as greeting and requesting, and with modal
and aspectual meanings such as obligation, certainty, evidentiality, disposition, iteration, and habituality. In analysis
and generation, we therefore take a construction-based approach (Fillmore and Kay, 1993) to these types of meanings,
and our interlingua represents these concepts in a way that is independent of their syntactic expression (as main verbs,
auxiliary verbs, affixes, adverbials, etc.) in the source and target languages. In the course of this proposed research we
would like to continue to identify the types of meanings that are associated with translation divergences, and also
study the types of syntactic constructions (formulaic or compositional) that are associated with those meanings.
Because the C-STAR and NESPOLE projects are collaborativeinternational projects – C-STAR has seven partners,
and NESPOLE has four – considerable effort has gone into designing an interlingua that is expressive enough to
provide accurate translations, flexible enough to port to new semantic domains or scale up to larger domains,
but at the same time simple enough to be used reliably by a diverse set of system developers who may not ever meet
each other. We have therefore developed evaluation metrics for expressiveness, scalability, portability, and cross-site
reliability (Levin et al., 2000, 2002). These metrics can be used as a starting point for the research proposed here,
although they will have to be refined and reformulated to apply generally to any interlingua.
12
The C-STAR and NESPOLE databases contain dialogues in the semantic domains of travel planning and medical
emergencies (chest pains and digestive problems). Dialogues were recorded and transcribed in English, German,
Italian, Japanese, and Korean. Some of the non-English dialogues have been translated into English, and some
remain monolingual in the database. Each utterance is broken into interlingua segment that roughly correspond to
sentences, and each segment is tagged with an interlingua representation. There are around 10,000 tagged segments
(sentences), which are used for interlingua and grammar development as well as for training of statistical methods
(Langley 2002). Intercoder agreement experiments are described in (Levin et al. 2002).
University of Southern California. The Natural Language Group at the Information Sciences Institute of the
University of Southern California has performed research in MT and multilingual text processing for over a decade.
Either in collaboration with others or their own, ISI researchers have built interlingual systems such as Pangloss
(Farwell et al 1994; Spanish to English; with CMU and NMSU) and Gazelle (1996; Japanese to English), statistically
trained systems such as Rewrite (Al-Onaizan et al 2000; ongoing; Arabic, Tetun, and later Chinese to English), and
shallow systems such as QuTE (Lin and Hovy 1999; Bahasa Indonesia to English). At ISI our long-term plan is to find
the optimal mixture(s) of statistical and symbolic/manual methods of creating the resources and transformation rules
required for MT. It has been a long-standing goal to apply some of the statistical learning techniques that provide wide
coverage but often somewhat lower quality or restricted performance (only short sentences, or inadequate
pronominalization or proper name rendering, etc.) to an interlingually annotated translation pair, so that one can start
overcoming the quality/limitation bottlenecks while maintainng the robustness so hard to achieve in purely manual
approaches. Should the proposed annotated corpus be created, therefore, we will eagerly apply our latest MT learning
techniques to it.
Curent work on MT at ISI focuses primarily on the development of staistical learning techniques that support a variety
of specific MT subtasks: proper name transliteration and translation, phrase unit recognition and translation, etc. A
small additional project (Hovy et al 2003), currently drawing to a close, focuses on the creation of a website that
organizes the complexity of MT evaluation measures—historically a rich, complex, and bewildering field of its
own—into taxonomies that allow potential MT system evaluators more easily to decide what they should measure for
their particular circumstances and how they should measure it (see http://www.isi.edu/natural-language/mteval/).
Slightly broader than MT, but relating directly to interlingua, the semi-automated ontology construction research at
ISI is developing a new large ontology called mega and a suite of tools for aligning terms into ontologies, extracting
terms from text, discovering cross-relationships between terms, and mining ontological information from websites,
dictionaries, and other text.
6. Relation to present state of knowledge in field and work in progress elsewhere
There are three central areas of activity to which the proposed research effort relates:
• Interlingua development: Text Meaning Representation (TMR), Interchange Format (IF), Lexical Conceptual
Structure (LCS), Sentence Planning Language (SPL) and AMR, Universal Networking Language (UNL)
• Data annotation and semantic networks: Penn Tree Bank, PropBank, FrameBank, Levin classes, Omega, Ontos.
• Other MT approaches: direct (SYSTRAN), transfer (Metal), example-based (Japan, UMIST, CMU), stochastic
(IBM, ISI, Germany); FAHQMT (and crummy MT), HAMT(CMU-controlled languages, CRL-Mikrokosmos),
MAHT (RALI, Trados).
Several centers for natural language processing and language technology development have or have had Machine
Translation projects which have followed an interlingual approach or involved the development of interlinguas. These
include Text Meaning Representation at the CRL and U. Maryland-Baltimore Campus, Lexical-Conceptual Structure
at U. Maryland, Interchange Format at the LTI at CMU, Penman’s Sentence Planning Language (SPL) and its
derivative AMR at ISI, and the Universal Networking Language (UNL) (http://www.unl.ias.unu.edu/) at several
13
centers around the world. The proposed research effort includes four of these groups and, therefore, is expected to
have a major impact on these efforts. It is unlikely that the resultant interlingua will have all of the features of all of
these interlinguas but it will clearly be informed by all of these efforts and it is expected to have a good deal of overlap
with existing interlinguas. The primary contribution of the proposed effort to existing interlinguas is, on the one hand,
to serve as a vehicle for unifying or standardizing them and, on the other, to provide an evaluation methodology and
corpus for testing the coverage and accuracy of interlingual systems.
In regard to the various other data collection and annotation efforts, including those related to the construction of
ontologies or semantic nets, the proposed research program should, in general, fold in with them effectively. Virtually
every semantically-oriented data collection and annotation effort focuses on some aspect of interlingua, whether that
is conceptual structure (or word meaning or ontology construction), state or activity classification (or verb
subcategorization or semantic classes), thematic roles (or verb case frames or valency), or propositional structure
(predicate-argument structure).
Finally, the proposed research is clearly related to other on-going efforts in MT, even ones that do not involve
interlinguas. An interlingual approach to MT design and development has traditionally been juxtaposed to the other
basic strategies for achieving fully-automated high quality MT: direct approaches as exemplified by most currently
available operational systems such as SYSTRAN in its early versions, transfer approaches as exemplified by the
remaining operational systems such as Metal and Systran in its more advanced version, and example-based
approaches. With respect to all these efforts, the proposed research will be of interest and use to those systems (such as
direct or transfer-based systems) which can and do make use of semantic information where possible. As for the other
major classification of MT systems, i.e., the distinction between fully-automatic MT, human-assisted MT (HAMT,
e.g., CMU-controlled languages, CRL-Pangloss) or machine assisted human translation (MAHT, e.g., RALI, Trados),
the proposed research mainly offers an evaluation corpus of parallel text against which to test systems with each new
version.
Just as research in the various representational and lexical phenomena will inform our work, our work should inform
these other efforts. The central difference between assigning, say, propositional content to a text in one language and
propositional content within an interlingua markup is that while the former will need to account for presuppositions,
entailments and default inferences, the latter, in addition, needs to account for translation equivalence relations and
translation divergences as well. That is to say, interlingua markup needs to account for multilingual relationships as
well as monolingual relationships.
7. Management Plan
The research efforts will include corpus development, tool development, comparative analysis of translations,
interlingua specification, corpus annotation, development of an evaluation methodology and evaluation. The proposed
length of the project is 36 months. All of the efforts will be directed by the project PIs. The Gantt chart (following
page) describes the overall project plan and schedule.
Corpus development. For Spanish, French and Japanese, a third translation will be produced for each of the existing
125 texts in each source languages. Chinese and Arabic corpora with multiple English translations will be obtained
from LDC. For all other languages, a source language corpus of 125 news articles will be compiled and then translated
into English by three independent translators (5 weeks effort).
Toolkits. This task will involve preparation or development of tools for each language, including a tokenizer, sentence
boundary detector, named-entity recognizer, part-of-speech tagger, phrase and clause recognizers, alignment tools,
and interface tools (6 months effort).
14
Preparing the corpora. This involves applying the various tools for automatically marking up, segmenting and
aligning the texts followed by any hand correction needed for all the texts and translations in each corpus (1 month
effort).
Workshops. The research partners will hold a 2-day or 4-day workshop four times, with open participation, at which
a proposal for the interlingua content for annotation will be presented, discussed, modified and adopted by the project
participants. The workshop will also focus on the annotation methodology as well as the evaluation methodology. (1
month preparation for each workshop).
3 Cycles of annotation and evaluation of corpora. First, a comparative analysis of each source language text and its
three translations will be carried out to identify and categorize each translation variation (3 months effort for each
language). The annotation effort involves developing a common interface for annotating multilingual parallel corpora
for interlingua content (6 months’ effort). In each cycle, the annotated corpora will be evaluated, and results will be
compiled and reported (1 month effort). The toolkits and annotation interface will be revised and undated as necessary.
(1-2 months’ effort in each cycle). The cycle of annotation and evaluation will be repeated three times. This cyclical
planning-annotation-evaluation process will ensure the development of a resource that isconsistent as possible.
Furthermore the openness of the planning and evaluation phases ensures that the resultant corpora will be usable by
the largest number of groups possible. A final 2-day workshop will be held before or during month 35, which would
focus on a final critical review of the evaluation methodology, the interlingua and the annotated corpus.
Documentation. Each corpus, tag set, toolkit, comparative analysis, interlingua subsystem, annotated text, evaluation
methodology and evaluation result will be documented and disseminated in a written publication or report. A final
project report will also be prepared.
Plan for documentation and sharing of research products
All data, both raw data and annotated data, will be placed in the public domain and made accessible via the internet to
any interested organization. In addition, the data will be made available through the Linguistic Data Consortium.
The tools used for this research project will also be made available to the research community and other interested
organizations via the project website from which they can be downloaded.
Reports, both quantitative and qualitative, on the results of the comparative study for each source-target language
corpus and for all corpora combined will be prepared and published at central scientific meetings (ACL, COLING)
and workshops (AMTA IL workshop, Stanford AI Spring series, ARDA Northeast Regional Workshop). In addition,
all preliminary results will be published as part of the CRL’s Memoranda in Computer and Cognitive Science series.
.
15
16
Download