Machine Translation Through Clausal Syntax:

Machine Translation Through Clausal Syntax:
A Statistical Approach for Chinese to English
by
Dan Lowe Wheeler
Submitted to the Department of Electrical Engineering and Computer Science
in partial fulfillment of the requirements for the degrees of
Master of Engineering in Computer Science and Engineering
and
Bachelor of Science in Computer Science and Engineering
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
June 2008
@ 2008 Massachusetts Institute of Technology. All rights reserved.
The author hereby grants to M.I.T. permission to reproduce and distribute
publicly paper and electronic copies of this thesis document in whole or in
part, and to grant others the right to do so.
Author ...............................
Department of Electrical Engineering and Computer Science
May 23, 2008
Certified by..............................
............
.-...
sociate P
-.....~/
- .
......
Michael Collins
ssor of Computer Science
:.Thesis Supervisor
Accepted by...............
Arthur C. Smith
Professor of Electrical Engineering
Chairman, Department Committee on Graduate Theses
MASSACHUSETTS INSTMTTE
OF TECHNOLOGY
NOV 1 3 2008
3...
RARIES
ARCHIVES
Machine Translation Through Clausal Syntax:
A Statistical Approach for Chinese to English
by
Dan Lowe Wheeler
Submitted to the Department of Electrical Engineering and Computer Science
on May 23, 2008, in partial fulfillment of the
requirements for the degrees of
Master of Engineering in Computer Science and Engineering
and
Bachelor of Science in Computer Science and Engineering
Abstract
Language pairs such as Chinese and English with largely differing word order have proved
to be one of the greatest challenges in statistical machine translation. One reason is that
such techniques usually work with sentences as flat strings of words, rather than explicitly
attempting to parse any sort of hierarchical structural representation. Because even simple
syntactic differences between languages can quickly lead to a universe of idiosyncratic surfacelevel word reordering rules, many believe the near future of machine translation will lie
heavily in syntactic modeling. The time to start may be now: advances in statistical parsing
over the last decade have already started opening the door.
Following the work of Cowan et al., I present a statistical tree-to-tree translation system
for Chinese to English that formulates the translation step as a prediction of English clause
structure from Chinese clause structure. Chinese sentences are segmented and parsed, split
into clauses, and independently translated into English clauses using a discriminative featurebased model. Clausal arguments, such as subject and object, are translated separately using
an off-the-shelf phrase-based translator. By explicitly modeling syntax at a clausal level, but
using a phrase-based (flat-sentence) method on local, reduced expressions, such as clausal
arguments, I aim to address the current weakness in long-distance word reordering while still
leveraging the excellent local translations that today's state of the art has to offer.
Thesis Supervisor: Michael Collins
Title: Associate Professor of Computer Science
Acknowledgements
Many thanks to Michael Collins, who introduced me to Natural Language Processing both
in the classroom and in the lab, and advised my thesis. I'd like to thank Brooke Cowan and
Chao Wang for their continual guidance, especially in the beginning. Thanks to Mom and
Dad. And a huge thanks to Hibiscus for hearing me out on those Saturdays I spent in lab.
Contents
1 Introduction
2 Previous Research in Machine Translation
2.1
13
. . . . . .. . . . . . . . 14
A Quick History of Machine Translation ......
2.2 Advances in Statistical Machine Translation . . . . . . . . .. . . . . . . . . 16
2.2.1
2.2.2
Direct Transfer Statistical Models ......
. . . . .. . . . . . . . . 17
2.2.1.1
Word-based Translation . . . . . . . . . . .. . . . . . . . . 18
2.2.1.2
Phrase-based Translation
. . . . . . . . . .. . . . . . . . . 19
. . . . . . . . . . . . . . 22
Syntax-Based Statistical Models .......
3 Conceptual Overview:
Tree-to-Tree Translation Using Aligned Extended Projections
3.1
Background: Aligned Extended Projections (AEPs)
3.2 Machine Translation:
........
Chinese-to-English-specific Challenges
4 System Implementation
4.1
End-to-End Overview of the System .........
4.1.1
Preprocessing . ................
4.1.2
Training Example Extraction
4.1.2.1
........
Clause splitting . ........
.
4.1.3
4.1.4
4.1.2.2
Clause alignment ..................
4.1.2.3
AEP extraction ...................
Training ............................
4.1.3.1
Formalism: A history-based discriminative model
4.1.3.2
Feature Design .
Translation
4.1.4.1
4.2
4.3
4.4
..................
...................
. . . . .. . . . . . . . 39
Gluing Clauses into a Sentence . . . . . . . . . . . . . . . . 40
Chinese Clause Splitting ................
. . . . . .. . . . . . . 40
4.2.1
Step 1: Main Verb Identification . . . . . . . . . . . . . . . . . . . . . 43
4.2.2
Step 2: Clause Marker Propagation Control
. . . . . . . . . . . . . 46
4.2.3
Design Rationale ................
. . . . . .. . . . . . . 50
Clause Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.3.1
Method Used and Variations to Consider . . . . . . . . . . . . . . . . 53
4.3.2
Word Alignments: Guessing Conservatively
AEP Extraction
4.4.1
4.4.2
....................
. . . . . . . . . . . . . 54
.. . . . . . .. . . . . 57
Identifying the Pieces of an English Clause . . . . . . . . . . . . . . . 59
4.4.1.1
Main and Modal Verb Identification
4.4.1.2
Spine Extraction and Subject/Object Detection . .....
4.4.1.3
Adjunct Identification ........
. . . . . . . . . . . . . 61
62
.. . . . . . .. . . . . 63
Connecting Chinese and English Clause Structi re ...........
5 Experiments
.
64
67
5.1
AEP Prediction Accuracy
5.2
System Output Walk-Through ............
5.3
Further Work ........................
. . . . . . . . . . . . . . . . .
. . .
68
70
75
5.3.1
The Need for a Modifier Insertion Strategy . . . .
75
5.3.2
The Need for a Clause Gluing Strategy ......
76
5.3.3
Chinese Segmentation and Parsing
77
. . . . . . . .
6 Contributions
78
Chapter 1
Introduction
State-of-the-art statistical machine translation performs poorly on languages with widely
differing word order. While any translation scheme will struggle harder the less its two
languages have in common, one crucial reason for this performance drop-off is the current
near-complete lack of syntactic analysis: viewing input sentences as flat strings of words,
current systems tend to do an excellent job translating content correctly, but for a hard
language pair, particularly over longer sentences, the grammatical relations between content
words and phrases rarely come out right. Two Chinese-to-English translations produced by
the Moses phrase-based translator of Koehn et al. 12] serve to illustrate:
REFERENCE 1: The Chinese market has huge potential and is an important part
in Shell's global strategy.
TRANSLATION 1: The Shell, China's huge market potential, is an important part
of its global strategy.
REFERENCE 2: In response to the western nations' threat of imposing sanctions
against Zimbabwe, Motlanthe told the press: "who should rule Zimbabwe is Zimbabwians' own business and should not be decided by the western nations."
TRANSLATION 2: In western countries threatened to impose sanctions against
the threat of Zimbabwe, Mozambique, Randt told the media: "Zimbabwe
should
exactly who is leading the Zimbabweanpeople their own affairs, and not by
western
countries' decision."
Notice how the content phrases of TRANSLATION 1, such
as huge potential, important part,
and global strategy, are for the most part dead on. The relations between
them are severely
mangled, however: Shell now appears, to be equated to China's huge market
potential, and
it is unclear what its refers to. Aside from translating Motlanthe as Mozambique,
Randt,1
the content in TRANSLATION 2 is similarly quite good. Incorrect
grammatical relations,
however, similarly ruin the sentence: the opening preposition seems to
suggest that the
event took place in western countries instead of Zimbabwe, and the quote
itself is far from
grammatically competent.
One reason for such low grammaticality is that it is difficult to learn rules for
reordering a
sentence's words and phrases without any knowledge of its high-level structure.
The following
is an example of how, when viewed as flat strings, word order between Chinese
and English
can quickly become complicated:
English Word Order:
Professor Ezra Vogal is one of America's most famous experts in Chinese
affairs
Chinese Word Order: Vogal Ezra Professor isAmerica most famous DE China
affair expert ZHI one
Even simple syntactic differences between languages can, on the surface,
lead to incomprehensible differences in word order. One example is the head-first/head-last
discrepancy
between Japanese and English: Japanese parse trees are similar to those of
English, except
with each node's child order recursively flipped. This discrepancy alone (and
there are of
course many more) leads to an ocean of idiosyncratic word reordering rules,
and yet structurally speaking it can be described quite simply. A similar discrepancy
between Chinese
and English is illustrated in the example sentence pairs below:
1This particular mistranslation of Motlanthe likely stems from a mistake
made during Chinese word
segmentation, by accidentally splitting Motlanthe into multiple words.
[[the person] [WHO bought the house]]
Chinese Word Order: [[bought house DE ] [ person ]]
English Word Order:
A modifier clause goes after
the thing it modifies in English,
before inChinese
Word order quickly diverges as modifier clauses are added
[[the person] [WHO bought [[the house] [THAT everybody wanted]]]]
Chinese Word Order: [[ bought [[ everybody wanted DE ] house I DE ] [ person ]]
English Word Order:
In English, a relative such as who bought the house follows after the noun phrase it
modifies: the person who bought the house. In Chinese, on the other hand, a clause comes
before the noun phrase, forming bought house DE person (where DE is the Chinese equivalent
of who in this context.) When the object of the clausal modifier is itself a complex noun
phrase with its own modifier clauses, the word orderings between languages quickly become
divergent. For a translation system to be able to model this difference between Chinese and
English, clearly some form of representation is needed that can identify the (often nested)
relative clauses and the noun phrases that they modify.
To summarize, machine translation must go beyond flat-sentence-based representation if
it is to succeed in translating languages with significant syntactic differences. The good news
is that steady advances in parsing over the last decade have made syntactically motivated
translation more viable. This thesis applies the discriminative tree-to-tree translation framework of Cowan et al [1] to the Chinese-to-English language pair, with the goal of improving
performance over current methods through explicitly modeling the clausal syntax of both
languages.
Chapter 2 reviews previous research in machine translation and orients the reader within
its current landscape. Chapter 3 introduces and motivates the core concept of clausal translation using Aligned Extended Projects, and discusses inherent Chinese-to-English-specific
challenges. Chapter 4, the meat of the thesis, explains how the system actually works. Chapter 5 probes system performance through experiments and discusses essential further work.
The final chapter summarizes my contributions.
Chapter 2
Previous Research in Machine
Translation
Machine Translation - the automatic translation of one natural language into another through
the use of a computer - is a particularly difficult problem in Artificial Intelligence that has
captivated researchers for nearly 6 decades. Its difficulty and holy grail-like end goal has
only made it a more fascinating and addictive research area. At the same time, however,
the fact that learned humans can do translation effectively is a proof by example that "Fully
automatic high quality translation" is an achievable dream. Moreover, despite its flaws, MT
has already proved itself over the years as a useful technology, and one that has initiated
several successful commercial ventures including SYSTRAN and Google Translate.
This chapter aims to briefly outline the history of MT, to orient the reader within the
current state of the art, and to contrast the work of this thesis against other contemporary
approaches. My goal is not to give a comprehensive summary of the current literature;
rather, I have selected a few bodies of work that illustrate the field as a whole.
2.1
A Quick History of Machine Translation
Initial enthusiasm for machine translation began shortly after the great success of World
War II code breaking efforts. Warren Weaver, the director of the Natural Sciences Division
of the Rockefeller Foundation, released the first detailed proposals for machine translation
in his March, 1949 memorandum
15].
These proposals were heavily based on information
theory and cryptography, modeling a foreign language as a coded version of English. His
proposals were a stimulus for research centers around the globe.
The Georgetown-IBM experiment, performed on January 7th, 1954, was the first widely
publicized demonstration of machine translation. The experiment involved the fully automated translation of 60 Russian sentences into English. Specializing in the domain of organic
chemistry, the system employed a mere 6 grammatical rules and a vocabulary of only 250
words. It was nevertheless widely perceived as a great success - the authors went so far as
to claim that machine translation would be a solved problem in a couple years [6], believing
that the complexity of natural language wasn't significantly greater than the codes that had
been fully deciphered during WWII.
Within those 3 to 5 years researchers began to understand just how complex the problem
of machine translation indeed is. Their early systems mostly used painfully hand-enumerated
bilingual dictionaries and grammatical rules for modeling word order differences between languages, a technique that was quickly recognized as fragile and restrictive. New developments
in linguistics at the time, largely Chomsky's transformational grammar and the budding field
of generative linguistics, served as a theoretical foundation for more expressive translation
systems.
Fueled heavily by the Cold War, research in machine translation continued full steam
throughout the early 1960s. It is worth mentioning that imperfect translation systems proved
to be valuable tools even at this early stage. As an example, both the USSR and the United
States used MT technology to rapidly translate scientific and technical documents, to get a
rough gist of their contents and determine whether they were of sufficient security interest -
whether they should therefore be passed to a human translator for further study. The poor
quality of the translations, in this context, was made up for by the stamina with which a
computer mainframe could perform.
In 1966, after a decade of high enthusiasm, heavy governmental expenditure, and results
that were far less than satisfying, research in machine translation, in the United States at
least, almost entirely came to a halt. The Automatic Language Processing Advisory Committee (ALPAC), a body of seven scientists commissioned by the US government, published
an influential and pessimistic report that triggered a major reduction in funding for fully
automated machine translation. It recommended instead that more focus be placed on foundational problems in computational linguistics, as well as tools for aiding human translators
such as automatically generated dictionaries [20]. While such foundational research did give
better theoretical background for modern MT, 1966-1978 was a difficult period for the field.
It is important to note that part of the aftermath of this loss of funding was the establishment of several commercial ventures by former academics. Several successful MT systems
resulted that are still in use today, including SYSTRAN 1 and Logos.
A source of frustration in building MT systems at the time lay in the endless amount
of effort highly trained linguists and computer scientists had to spend explicitly modeling
a pair of languages and the differences between them. Most of the systems of the time
can safely be called rule-based for this reason, in that they consisted of a large body of
hard-coded rules for how to analyze, represent, and translate one language into another.
SYSTRAN and Logos fit squarely into this paradigm. MT Research steadily accelerated in
the 1980s, budding into several new approaches by the end of the decade. Statisticalmachine
translationis one such area that has the great advantage of not requiring an enumeration of
rules and other linguistic knowledge. Rather, SMT approaches start with a large corpus of
text, either plain or linguistically annotated, and learn how to translate on their own through
statistical models and machine learning algorithms. Interest in SMT grew as the availability
iThe core translation technology used in Babel Fish and formerly Google Translate.
of inexpensive computational power, in step with the advent of personal computers, began
to soar.
While relatively nascent, statistical machine translation is now the dominant subfield
in MT research. Google Translate is an excellent commercial SMT success story, and was
consistently ranked at or near the top of the last NIST Machine Translation Evaluations in
2006. The next section reviews SMT up to the present.
2.2
Advances in Statistical Machine Translation
Machine translation models, statistical or otherwise, are traditionally pictured in terms of
the machine translation pyramid diagrammed in figure 2.1. Machine translation is conceptualized into three stages: analysis, transfer and generation. Analysis interprets an input
text to produce the representation that the translation system reasons about. For example, a
system might perform a syntactic analysis that produces a parse tree from a foreign sentence.
On top of that, it might go further to analyze the semantics of the sentence, forming a deeper
representation such as first-order logic or lambda calculus. The final stage, generation, works
in reverse of analysis, extracting an English text from the transferred representation. As an
MT system's internal representation - a parse tree, for example - often consists of strictly
more information than the input text that it is constructed from, this generalization step is
typically easier than the analysis. It is for deeper analysis, where the internal representation
moves increasingly further away from the original foreign input, that generation becomes a
substantial challenge.
Typically, as more analysis is performed, the transfer stage becomes less and less prominent. Hence the pyramid. At the top of the pyramid, where foreign texts are converted to
a language-independent interlingua representation, no transfer step is needed at all. Similarly, systems at the base of the pyramid perform no analysis and instead comprise one large
transfer step, reasoning with input texts as flat strings of words.
Interlinqua
ation
Ar
Foreign Words
English Words
Figure 2.1: Machine Translation Pyramid. Machine Translation is viewed in three stages: analysis, transfer
and generation. Analysis first interprets a foreign input text to form the representation that the translation
system reasons over. Transfer then converts this representation into an English representation. As more
analysis is performed, transfer becomes less prominent. At the top of the pyramid, where texts are converted
to a language-independent interlingua representation, no transfer stage is needed at all. Systems at the base
of the pyramid conversely consist of one large transfer step, performing no explicit syntactic or semantic
analysis. The final stage, generation, extracts an English sentence from the transferred representation.
Statistical machine translation is still crawling at the bottom of the pyramid. Direct
transfer architectures to date, excepting a few syntax-based competitors, have significantly
outperformed all other approaches that involve some sort of syntactic or semantic analysis.
Climbing up one rung of the MT triangle, several new syntax-based translation models are
beginning to show potential. The next sections review direct transfer and syntax-based
translation models, and contrast the approach of this thesis against other works.
2.2.1
Direct Transfer Statistical Models
Direct transfer models work with foreign text nearly the way that it is input: as a flat string
of word tokens. Their representation is therefore minimal. Word-based and phrase-based
direct transfer approaches are reviewed in this section.
2.2.1.1
Word-based Translation
In the early 1990s, IBM developed a series of statistical models that, despite a near complete
lack of linguistic theory, pushed forward the state of the art and inspired future systems.
While the models, now named IBM Models 1 through 5, did not always outperform wellestablished rule-based systems such as SYSTRAN, part of the attractiveness of the IBM
models was the relative ease with which they could be trained. Contrasted with the years
upon years that linguists and computer scientists spent encoding the translation rules behind SYSTRAN and other systems, the designers of the IBM models could simply hand a
computer a corpus of parallel sentence translations and let it crunch out the numbers quietly
on its own, simply by counting things - for example, the co-occurances of English words
with foreign words.
The IBM models, and several SMT systems to follow, all learn parameters for estimating
p(elf), the probability of an English sentence e given a foreign sentence f. Applying Bayes'
rule, this is equivalent to maximizing p(e) -p(fle), where p(e) is a language model - a
probability distribution that allocates high probability mass to fluent English sentences,
independent of the corresponding foreign input - and an inverted translation model p(fle).
The IBM models estimate p(f le). A range of techniques for building language models, such
as smoothed n-gram models, are combined with an IBM model to form p(elf). Translation
then becomes a decoding problem in this view:
e = argmaxep(elf) = argmaxep(e) -p(f e)
Translation decodes a foreign sentence into an English sentence, by finding the English
sentence 6 that maximizes p(elf). The IBM models are commonly called word-based systems
because they make predictions for how each word moves and translates, independent of other
words.
IBM models 1 through 5 break p(fle) into an increasingly complex product of several
distributions, with each distribution measuring a different aspect of the variation between
one language and another. The following diagram, adapted from [24], is a canonical example
that demonstrates a minor simplification of IBM Model 4 translating a sample English
sentence into Spanish. The model works in four steps: fertility prediction, null insertion,
word-by-word translation, and lastly, distortion.
Mary did not slap the green witch
_ _/
(3Fertility
n(31slap)
Mary not slap slap slap the green witch
Tp-null
Null Insertion
Mary not slap slap slap NULL the green witch
Translation
t(verde I green)
Maria no daba una botefada a
!X
verde bruja
Distortion
verde
Maria no daba una botefada a ia bruja
Fertility allows one English word to become many foreign words, to be deleted, or to
remain as a single foreign word. In this case, "did" is deleted (fertility 0), and "slap" grew
into 3 words (fertility 3). Null insertion allows words to be inserted into the foreign sentence
that weren't originally present in the English. Later down the road, the NULL word in this
example becomes the the Spanish function word "a". Next, each English word, including
NULLs, is independently translated into a Spanish word. Finally, distortion parameters
allow these words to be shuffled around, to model the differences in word order between
Spanish and English. Probability distributions for each of these steps are learned during a
training phase, by counting various events in a parallel corpus. An Expectation Maximization
algorithm 123] is typically employed to do the counting. During training, the decoder searches
for the English sentence e that maximizes the combined product of these four groups of
learned distributions.
2.2.1.2
Phrase-based Translation
A weakness of word-based systems that early phrase-based systems - the next generation
of direct transfer SMT - improved greatly upon is the loss of local context that is inherent
in translating each word independently. The IBM models can't leverage the fact that very
commonly between one language and another, the translation of a given word will depend
heavily on the words nearby. Consider an idiom for example: it is much more natural to
translate an idiom like kick the bucket as a whole, than it is to consider all the combinations
of word-by-word translations and pick the most probable. Similarly, collocations such as
with regards to are often best translated as a whole. In translating from English to Chinese,
for example, it is very common to translate with regards to as the single word )T.
Phrase-based systems improve upon word-based systems in exactly this area by modeling many-to-many word translations - building a phrase table, rather than a word-to-word
dictionary, that counts co-occurances of foreign and English phrases among other events.
Phrase-based systems learn a probability distribution p(fle) similar to the IBM models, and
can be seen as consisting of three steps: dividing the sentence into contiguous phrases, reordering the sentence at the phrasal level, and lastly, translating each phrase independently
to form the output foreign sentence. Combined with a language model p(e), the model can
then be used to estimate p(elf) via Bayes' rule.
She told me with regards to the new agenda to proceed on Thursday
Input Sentence
IShe told me with regards tolthe new agendallto proceed on Thursdayl Identify Phrases
Reorder Phrases
the new agenda with regards to She told me to proceed on Thursday
I
I
Ithe new agenda with regards tollShe told mellto proceed on Thursday
Translate Each Phrase
Independently
Output Sentence
(italics indicate foreign translations)
Note that the phrases that a phrase-based system learns to recognize are only of a statistical significance; the phrases are not necessarily linguistically motivated or otherwise
theoretically grounded. Similar to the IBM models, translation consists of a decoding step
where the optimal phrase boundaries, reorderings, and phrase translations are searched for
over all possibilities. Due to the combinatorial explosion of possibilities in the search, designing algorithms for how to build phrase tables during training, and how to search over
them during online translation, is no easy challenge. A large body of literature exists on
these subjects, including [7, 2]. Moses is a popular open source phrase-based decoder used
in this thesis.
Like word-based systems, a great advantage of phrase-based systems is the relative simplicity of the models and corresponding ease with which they can be constructed. To train
a phrase-based decoder like Moses, the MT practitioner merely needs to hand the software
a parallel corpus of translations and leave it cranking for several hours or days. Simplicity
and ease of training aside, phrase-based systems have now become a mainstream technique
in machine translation, contending well with SYSTRAN and other culminations of several
decades of research in rule-based systems. Google Translate, for example, perhaps the leader
in commercial machine translation, has quietly switched from a SYSTRAN backend to an
in-house phrase-based system. By October 2007 all supported language pairs were switched
to Google's own software [13].
Phrase-based systems are nevertheless quite far from the accuracy and fluency of human
translations. Perhaps the most accessible shortcoming of the models is their inability to learn
syntactic differences between languages: they have no abstract sense of nouns, verbs, clauses,
prepositional phrases, and so on. Rather, the knowledge that a phrase-based system (and
other direct-transfer systems) gleans from a corpus is highly repetitive - highly lexicalized in that it must learn the same patterns for distinct but similar word combinations. Consider
the following similar Spanish-to-English translations, taken from Cowan's PhD dissertation:
SPANISH:
la cosa roja
SPANISH:
la manzana verde
GLOSS:
the house red
GLOSS:
the apple green
ENGLISH:
the red house
ENGLISH:
the green apple
Both cases involve a Determiner+Noun+Adjective -> Determiner+Adjective+Noun translation reordering, and yet a phrase-based system would need to learn separate rules for each
word group. One can imagine the combinatorial explosion of lexicalized rules, all of them
related to this simple syntactic difference between Spanish and English. A phrase-based
system therefore cannot effectively generalize this pattern to use it on a new, previously
unencountered Determiner+Noun+Adjective combination.
The Spanish-English example illustrates one small difference between noun phrases of
each language. Differences at higher syntactic levels, ones that result in substantial longdistance word reorderings, also cannot at all be modeled using phrase-based systems. The
head-first/head-last discrepancy between English and Japanese, and the difference in the
attachment of relative clauses in English and Chinese are both excellent examples of syntactic phenomena that phrase-based systems cannot effectively capture. Wh-movement, the
phenomena in English responsible for the movement of question words like which and what
to the beginning of a sentence, is another classic example.
Researchers in the last few years have made progress augmenting phrase-based systems
to incorporate some syntactic knowledge. Liang et al. 121] made improvements by moving
part-of-speech information into a phrase-based decoder's distortion model. Koehn and Hoang
122] have generalized phrase-based models into factor-based models that behave much the
same way but can incorporate additional information like morphology, phrase bracketing and
lemmas.
2.2.2
Syntax-Based Statistical Models
Syntax-based statistical machine translation is growing increasingly prominent in the field.
This section describes two rough categories in which such systems tend to differ - choice of
representation and the languages modeled - and orients my system within those categories.
I and many others incorporate context-free phrase structure into our syntactic representations. Several other representations, however, are currently being explored. In the
last few years, tree-to-tree transducers 115] have surpassed phrase-based systems in Chineseto-English translation, as judged through both human evaluation and statistical metrics.
Chiang's [16] hierarchical phrase-based representation, at the other end of the spectrum,
effectively generalizes broad syntactic knowledge from an unannotated corpus, from within
a much looser formalism.
The languages that are syntactically modeled - source, target or both - form another
dimension over which syntax-based SMT systems differ. My approach and other tree-to-tree
translation systems, such as Nesson and Shieber's approach [25], model both the syntax of
source and target languages. Xia and McCord [18] focus on source-side syntax, automatically
learning how to restructure the child order of the nodes of a source parse tree to make it
look more like a target parse tree. Yamada and Knight [17], on the other hand, present a
system that probabilistically predicts a target parse tree from an input source string.
So far I've described my system in broad terms. More specifically, my system focuses
on modelling clausal syntax, the syntax of verbs, through Aligned Extended Projections.
Chapter 3 introduces and motivates this model in detail.
Chapter 3
Conceptual Overview:
Tree-to-Tree Translation Using Aligned
Extended Projections
Brooke Cowan, Ivona Kuierovd and Michael Collins designed and implemented a tree-totree translation framework for German to English in 2006 [1], with similar performance
to.phrase-based systems 12].
This thesis centers around applying their approach to the
Chinese/English language pair, and exploring new directions for their relatively new model.
This chapter introduces and motivates that model at a conceptual level. Section 3.2 explains
why "porting" Cowan's approach to Chinese-to-English is exciting and substantial, rather
than merely a matter of implementation.
At the highest level, translation proceeds sentence-by-sentence by splitting a Chinese
parse tree into clauses, predicting an English syntactic structure called an Aligned Extended
Projection (AEP) for each Chinese clause independently, and, finally, linking the AEPs to
obtain the English sentence. Verbal arguments, including subjects, objects, and adjunctive
modifiers (such as prepositional phrases), are translated separately, currently using a phrasebased system [2]. The hope is to achieve better grammaticality through syntactic modelling
independently
High-level: Split Chinese sentence into clause structures, predict English clause structures
PREDICT
NP
NP
V
N
VP
NP
V
N
C(Paskwlauw
CP-bedinu
haog %aid
P,,Ap~oru
- l"e
NP
a', ,: :,)
f hirn
SP
i, .j
PREDICT
VP
NP
U
N
VP
VP
VP
NP
I
PN
VP
V
NP
V
VP
be
ihj(
NP
a
NP
v
P
'P
PP
.. I(
NP
V
DFT' N
maim
PP
y I
NP
(lio
kE
VP
Ti
t PREDIC
:P
'ow
Ri
the trtaa
PREDICT,,.
VP
PP
PP
V
IA
]L:
E h
)Nk91C~ il~
rtt(PVP$JY~
hew'~
Pt
ttttatpe
PP
+ii o'+ km hrlwmn
into separate
Figure 3.1: Translation scheme at a high level. A Chinese sentence is first parsed and split
then predicted for
clauses. An English clause structure, called an Aligned Extended Projection (AEP), is
to obtain the
each Chinese clause subtree independently. The final step (not shown) is to link the AEPs
the
illustrates
3.2
figure
final English sentence. Note that this is a simplification of the AEP repesentation;
AEP prediction step in detail.
direct
at a high level, while still maintaining high-quality content translation by applying
and
transfer methods to reduced expressions. Figure 3.1 shows a simplified view of the split
prediction stages on a Chinese-to-English example.
3.1
Background: Aligned Extended Projections (AEPs)
Aligned Extended Projections build on the concept of Extended Projections in Lexicalized
Tree Adjoining Grammar (LTAG) as described in [3], through the addition of alignment
information based on work in synchronous LTAG [4]. Roughly speaking, an extended proa
jection applies to a content word in a parse tree (such as a noun or verb), and consists of
syntactic tree fragment projected around the content word, which encapsulates the syntax
that the word carries along with it. This projection includes the word's associated function
words, such as complementizers, determiners, and prepositions. The figure below shows the
EPs of three content words - said, wait, and verdict - for the example The judge said that
we should wait for a verdict. Notice how the EPs of the two verbs, said and wait, include
argument slots for attaching subject and object. Importantly, an extended projection around
a main verb encapsulates that verb's argument structure: how the subject and object attach
to the clausal subtree, and which function words are associated with the verb.
S
NP
SBAR
VP
SBAR
V
C
that
said
S
NP
VP
V
I
PP
should
V
PP
NP
P
I
for
VP
wait
D
N
I
I
a
verdict
An Aligned Extended Projection (AEP) is an EP in the target language (English) together
with alignment information that relates that EP to the corresponding parse tree in the source
language (Chinese). For the remainder of this proposal I will speak of clausal verb AEPs
only: in this case the extra alignment information includes where the subject and object
should attach, and where adjunctive modifiers should be inserted.
Anatomy of an Aligned Extended Projection
A good way to understand an AEP is to investigate the way my system predicts one from
a Chinese clause. The AEP prediction step - the core of the translation system - is diagrammed thoroughly in figure 3.2.
(The linguistic terminology used in this overview is
also reviewed toward the right of the figure.) The input is a Chinese clause structure: an
SDEC
IN
PredictEnglishCause Backbone
that
(DE)
NP- ubj
PP VP
_Terminology
S
CP
NP-Subj
VP
V
Input:Chinese clause structurewith
modfers [1] and [2] Identfied
Englishclause backbone structure
(called the verb'sExtended Projection).
Argument essentil clausal arguments,
such as subject and object,
thata givenclauseneedsto
be grammaticallycomplete:
Thebiyate a sandMwhh
Adjunct:somethnglike a prepositionalphrase
that adds o a clause,andyet if it were
removed,the clausewill soilbe grammatical:
Theboy fromNew rt a sandwichat noon
Modifier:
arguments
andadjuncts
om New obrkatea sandwkh
The
Predict English argument
modifiers
Step 3:
Graft/Deleteremanng Chinese modifiers
IN
that
NP-A
VP
-that
V
Output:Englishclause structure withfull alignment information
(a fullAligned ExtendedProjectlon:
havebeengrafted onto various positions, ordeleted.
remainingChinesemodifiers
position.
[2] was graftedto the post-verb
in this case. Chinese modifiers
IN
IN
NP-Subj
VP
v
witharguments aligned.
Englishclause structure
(Inthis case, the subject position s aligned to
Chinesemodifier11].)
Extended
Projection
(EP):
the syntactic fragment a contentwordprojects
und el. A verb'sEP reveals the structur o
theclause:where the subjectandobjectattach,
and whichfunctionwords(suchas that andto) am
usedto relatethe arguments.
(AEP):
AlignedExtendedProjection
an extended projectionwith exta infrmaon that
foreignstructure.This
aligns It to a correspondingo
thesis modelsthe EnglishAEPsof clausalverbs,so the
of whichChesmse
alignmnt Infrnnatonconsists
subject and
modller re inserted Intothe English
object positions andin which positionsthe remaining
modifiers are grated," Ifanywhere.
Chinesemodifier
Notethat In this representton, any
maybe Inserted Into Englishsubl orobject position.
As examples, a Chinesepreposionalphrase
(an adlunct) canbecome the Englishsubject,
and a Chinese subjectcanbe graftedonto its English
counterpartas an adjunct.
Figure 3.2: AEP Prediction broken down into three steps. The input is a Chinese extended projection.
The first step is to predict an English extended projection from the Chinese. The next two prediction steps
align that English clause to the Chinese clause: step 2 predicts where the English subject and object should
be aligned, and step 3 finally predicts the alignment of the remaining Chinese modifiers.
extended projection around a Chinese verb, indicating where modifiers such as subject and
object attach. The first step is to predict the corresponding English verb and its extended
projection. In the figure above, that extended projection indicates that the English clause
starts with the function word that, shows where the subject attaches, and also indicates that
the English clause has no object. The second and third steps predict how the English clause
structure is aligned to the Chinese input. Step 2 looks at the English clause's subject and
object attachment slots if present, and figures out which Chinese modifiers should be aligned
to them. Step three determines the alignment of any remaining Chinese modifiers.
A more technical specification of the AEP representation that I use in this thesis is given
in table 3.1. Specifically, an AEP can be thought of as a basic record datatype that consists of
several fields: an English verb, a syntactic spine projected around that verb that indicates its
clausal structure, and several pieces of alignment information - subject, object, and modifier
attachment - that link that English verb's clause structure to a parallel Chinese structure.
An additional AEP field indicates which wh function words (e.g. in which, where, etc.) are
AEP field
Description
STEM
SPINE
WH
MODALS
SUBJECT
Clause's main verb, stemmed.
Syntactic tree fragment capturing clausal argument structure.
A wh-phrase string, such as in which, if present in the clause.
List of modal verbs, such as must and can, if present.
- The Chinese modifier that becomes the English subject, or
- NULL if English clause has no subject, or
- A fixed English string (e.g. there) if English subject doesn't correspond to a
Chinese modifier.
OBJECT
MOD[1...n]
Analagous to SUBJECT
Alignment positions for the n Chinese modifier phrases that have not already been
assigned to SUBJECT or OBJECT.
Table 3.1: The AEP representation used in this thesis is a simple record datatype. During the AEP
prediction step - the core of the translation model - each field is predicted in sequence.
present in the English clause.'
My system predicts each of these fields in order: first the main verb, followed by its
English syntactic spine, and so on. The order matters because the prediction of a given
field depends in part on predictions made for previous fields: syntactic spine prediction (the
SPINE field), for example, is highly dependent on the main verb (STEM field) that the spine
is projected around.
Cowan's tree-to-tree framework [1] provides techniques for (1) extracting <source clause,
target AEP> training pairs from a parallel treebank, and (2) training a discriminative
feature-based model that predicts target AEPs from source clauses. My system uses modifications of these techniques, explained in detail in Chapter 4.
1
This representation is a simplification of that used in [1]. For example, Cowan's AEP representation
also included fields for the English clause's voice (active or passive) and verb inflection.
3.2
Machine Translation:
Chinese-to-English-specific Challenges
The easy side of this project is that Cowan's framework for German-to-English translation,
along with preexisting word segmenters, part-of-speech taggers, and parsers, is ready to be
leveraged. One might think that applying a translation system to a new language pair is a
relatively straight-forward task - grungy perhaps but not intellectually exciting. This section
illustrates why that is simply not the case, highlighting several of the unique challenges that
the Chinese-to-English language pair brings forward.
Linguistic modelling challenges: Novel feature set needed for Chinese-to-English
The features developed for Cowan's German-to-English translation framework heavily exploited dependencies specific to the German and English languages. One contribution of this
thesis is the development a new feature set tailored to the Chinese/English language pair.
To give one example, because the Chinese languages do not have an inflectional morphology,
new features must be developed to predict the inflected forms of translated English verbs.
Written Chinese uses the aspect marker zai ZHCAR, for example, to express the progressive
aspect, rather than the -ing verb suffix used in English. I developed features that are sensitized to this closed class of Chinese aspect words, to try to help predict English inflectional
forms.
Error propagation from word segmentation
In written form, Chinese words contain no spaces between them. While words do indeed
exist in the Chinese languages, word boundaries aren't present on paper. One theory is that
a character-based written system, as opposed to a phonetic system such as written English,
makes it easier for readers to recognize the different words in a sentence, so word separation
becomes unimportant. Either way, the extra processing step of segmenting (aka tokenizing)
Chinese sentences into words creates a new layer of ambiguity and an early source of error that
can easily derail later stages. As an example of one class of segmentation-related problems,
most segmenters, when confused, tend to break up a chunk of characters into many singlecharacter words.
Long transliterated names are particularly vulnerable to segmentation
into several nonsensical nouns and verbs.
The transliteration of "Carlos Gutierrez," for
example, contains seven characters selected for their sound. On their own, however, four of
the characters, as examples, mean "to block," "iron," "thunder," and "ancient." Translation
would fail hilariously if the segmenter were to break this name into these separate singlecharacter words.
A clause-by-clause translation system such as presented in this thesis
would fail particularly badly - each verb (such as "to block") created from the nonsensical
segmentation would likely be interpreted as its own clause, complete with subject and object.
The Chinese/English corpus used in this thesis was processed with a special segmenter
that attempts to lessen the impact of exactly this problem - when the segmenter doesn't
know how to divide up a string of characters, it tends to leave them strung together rather
than splitting them up into single character words.
Fragility with respect to parsing errors
The biggest danger in any translation scheme that incorporates syntactic analysis is that
the analysis can potentially do more harm than good through the mistakes it introduces.
Because Chinese-to-English translation performs rather poorly in methods that forgo syntactic analysis, my hunch in building this system was that the trade-off would be worth it.
At the same time, however, Chinese sentence parses are generally worse than those of German and English: word segmentation errors alone are often enough to result in nonsensical
parses. Additionally, properties of the Chinese language family, including topicalization and
subject/object dropping, also partially explain the relative decrease in parse quality from
a parser [10] designed for English. The challenge is to be especially cautious of erroneous
parses. My system is conservative in this respect, through safety checks made during the
extraction of training data. Essentially, if any safety checks fail, my system considers a given
sentence a misparse and moves on to the next sentence, without extracting any training
data. The details are explained in 4.3. Training aside, parsing mistakes made during actual
translation nevertheless greatly impact performance.
To give an example of the sorts of parse errors that can cripple my translation system,
consider Chinese names. Because most names are rare proper nouns, it is frequently the
case that a given Chinese name will not have been previously encountered in the training
data. Add the fact that the characters that compose Chinese names frequently have other
meanings, and one can understand why it is relatively easy for the parser to make an error
by not recognizing a name as a proper noun - or even a noun at all. Consider the name Lin
Jianguo ZHCAR, for example. The surname, Lin ZHCAR, 2 means forest. The given name,
Jianguo ZHCAR, roughly means to build the country. Given a misparse, my system could
easily dedicate an entire clause to this name, with disastrous results. Segmentation errors
only make these sorts of problems worse.
Lack of inflectional morphology
The Chinese language family contains no verb conjugations, and no tense, case, gender or
number agreement. Tense, for example, is established implicitly, and can only be determined
for a given sentence without other context if it contains a temporal modifier phrase such as
"today" or "last Saturday." One major extension to Cowan's model for Chinese-to-English
that may be worth pursuing in the future would be to do extra bookkeeping to keep track
of tense information from earlier clauses, as a means of filling in missing information in later
clauses. Clause predictions would then no longer be strictly independent of each other. For
now, I have no means of combating this problem.
2
Contrary to English, in Chinese languages the surname comes first, followed by the given name.
Subject/Object dropping
Chinese is a pro-drop language, meaning once established in previous sentences, the subject
and object (sometimes both) may be dropped from a sentence entirely. This makes sentenceby-sentence translation particularly ill-suited to Chinese. Similar to the implicit tense issue,
I'm currently thinking about ways to use subject and object information from previous
sentences to fill in missing information - it is a hard problem, and one not explored in this
thesis.
Rich idiomatic usage
Chinese formal writing is full of condensed 4-character idioms called Chengyu. These often
contain a subject and verb and yet should be translated into English as a single word. One
idiom, for example, translates directly as draw-snake-add-legs, and yet in most sentences
should be translated as the single word unnecessary or redundant. The Chinese language
contains thousands of such examples. Phrase-based systems tend to do a good job on these
types of condensed idioms, because they can bag up the entire idiom and translate it in
one piece. Syntactic analysis may get in the way in this respect: it is challenging to handle
idioms as effectively as phrase-based systems.
Chapter 4
System Implementation
A working Chinese-to-English clausal translation system is a major contribution of this thesis.
Chapter 4 presents this system and descries its parts in detail. Section 4.1 sketches the full
system at a high elevation, outlining the system's three stages: training data extraction,
training, and translation. Training data extraction - deriving <Chinese clause input, English
AEP output> training examples from a parallel corpus of Chinese and English sentences is the area that required the greatest amount of research. Sections 4.2-4.4 describe the three
steps that I developed: clause splitting, clause alignment and AEP extraction.
4.1
End-to-End Overview of the System
The system consists of three separate stages: extraction of training examples, training of
the model, and translation. Figure 4.1 gives an overview of the extraction and translation
stages. All sentences are first segmented, tagged, and parsed. These steps are reviewed
under 4.1.1. Training example extraction, reviewed in detail in sections 4.2-4.4, is sketched
briefly in 4.1.2. Given appropriate training data, the underlying AEP prediction model is a
linear history-based model trained with the averaged perceptron algorithm [12] and decoded
Output:
EnglishSentence
Output: TrainingPairs
t
T
r
IT
t
r
T
a
I
T
I
I
Chines
InentP
ntence
Input:
Parallel English Sentence
Training Data Extraction
Input:
Chinese
Sentence
Translation
Figure 4.1: Control flow during training example extraction (left) and translation (right).
Note that training happens between these two stages. Extraction consists first of monolingual analysis
over each language: word segmentation (Chinese only), part-of-speech tagging, parsing and clause splitting.
Chinese/English clause subtrees are next paired together according to word alignments, or thrown away if
judged inconsistent. An English AEP data structure is then generated from the clause pair, resulting finally
in a training example: a Chinese clause subtree paired with an English AEP. Translation starts with the
same preprocessing steps, by converting a Chinese sentence into several clause subtrees. An AEP is then
predicted from each clause independently according to the trained model. The final translation step is to
link the AEPs together and flatten them, yielding an English sentence as output.
using beam search.' Section 4.1.3 describes the training procedure. Section 4.1.4 describes
how the trained model is used to perform translations.
4.1.1
Preprocessing
Written Chinese, like many East Asian languages, does not delimit the words within a
sentence. It is therefore necessary to segment Chinese sentences into words before other
processing can take place. Chinese word segmentation is a difficult, frequently ambiguous
'Combining the perceptron algorithm with beam-search is similar to a technique described by Collins
and Roark [9].
task, and a prerequisite of all other steps.
Next, parsing takes a segmented sentence and derives a context free parse tree. I'm
currently using a Chinese-tailored variant of the Collins parser [10] trained on the Penn
Chinese Treebank [14]. Part-of-speech tagging is first performed using a maximum entropy
statistical tagger.
4.1.2
Training Example Extraction
After preprocessing, the training corpus consists of parallel Chinese/English sentence parses.
The next step is to process the corpus to extract the desired <Chinese clause subtree, English
AEP> training example input/output pairs. This is performed deterministically, through
several carefully tuned, hard-coded heuristics. These heuristics are the product of both direct
observation of the data, and the annotation standards of the Penn Chinese and English
treebanks. The aim is not to be correct 100% of the time, but to capture the bulk of the
accurate training data available with minimal rules.
Each training data extraction step is diagrammed in the left portion of figure 4.1. Clause
splitting, clause alignment, and AEP extraction are major contributions of this thesis, and
focused on in detail in subsequent chapters.
4.1.2.1
Clause splitting
Because my system produces translation at the clausal level, parse trees first need to be
broken into separate clause subtrees. I use a small set of rules to do this on the Chinese side,
in two steps. Minimal verb phrases are first marked as clauses. These clause markers then
propagate up through their ancestors according to a series of constraints. One constraint,
for example, allows an IP to pass its clause marker to a CP parent, in effect including a
complementizer word within the clause. Clause markers propagate until no more propagations are allowed. I have implemented this technique and found it works well. I use Brooke
Cowan's clause splitter on English parse trees. Clause splitting is reviewed in section 4.2.
4.1.2.2
Clause alignment
This step serves two purposes: (1) align the clauses in a sentence pair and (2) filter out potentially bogus training data. Clause alignments are made based on symmetric word alignments
from the GIZA++ implementation of the IBM models [7], by computing word alignments
for Chinese/English and English/Chinese and taking their intersection. Any sentence pair
that contains an unequal number of clauses is discarded from the start. Furthermore, if one
Chinese clause contains words aligned to multiple English clauses, or vice versa, all involved
clauses are discarded. This filtering is intentionally conservative (high precision, low recall),
to increase the quality of the training examples in the face of occasionally inaccurate parse
trees. Clause alignment is reviewed in detail in section 4.3.
4.1.2.3
AEP extraction
In the training stage, a machine learning algorithm will inspect a corpus of <Chinese clause,
English AEP> example pairs to attempt to learn how to predict an English AEP given a
Chinese clause. This final step of training data extraction refines a raw <Chinese clause,
English clause> translation pair into the format that the learning algorithm requires: a
<Chinese clause, English AEP> input/output example pair. AEP extraction is reviewed in
detail in section 4.4.
4.1.3
Training
This section reviews the underlying representation used to predict English AEP structures
from Chinese clauses: a history-based, discriminative model. Such a model offers two major
advantages. The greatest strength in a feature-based approach is the flexibility with which
qualitatively wide-ranging features can be integrated. As examples, features used in this
thesis include structural information about the source clause and target AEP under consideration, properties of the function words present (such as modal verbs, complementizers
and wh words), and lexical features of the subject and object. Secondly, feature sets can
be developed with relative ease for new language pairs: while features in the original model
exploited German/English dependencies, I built a custom set for Chinese/English.
4.1.3.1
Formalism: A history-based discriminative model
Following work in history-based models, I formalize AEPs in terms of decisions, such that
the prediction model can be trained using the averaged perceptron algorithm [12]. The
formalism presented closely follows that presented by Cowan et al. [1] in section 5.1.
Assume a previously extracted training set of n examples, (xi, yi), for i = 1... n. Each xi
is a clause subtree, with y being the corresponding English AEP. Each AEP yi is represented
as a sequence of decisions (dl,... , dN), where N, a fixed constant, is the number of fields
in the AEP data structure: d, corresponds to the STEM field, d2 corresponds to SPINE,
and so on, ending with dN as the last MODIFIER field.
Each d3 , j = 1... N, is a member of Dj, the set of all possible values for that decision. I
create a function ADVANCE(xi, (dl, d2 ,... , d(j_))) that returns a subset of Dj corresponding to the allowable decisions dj given xi and previous history (dl, d2 , ... , dj_). A decision sequence (dl,..., dN) is well-formed for a given x if and only if dj E ADVANCE(x, (dl,..., dj_l))
for all j = 1... N. The generator function GEN(X) is then the set of all well-formed decision
sequences for x.
The model is defined by constructing a function O(x, (dl,..., dj- 1 ), dj) that maps a decision dj in context x, (dl,..., djl) to a feature vector in Rk, where k is the constant number
of features in the model. For any decision sequence y = (dl,... , dn), partial or complete,
the score is defined as:
SCORE(x, y) = Q((x, y) a
where (I(x,y) is the sum of the feature vectors corresponding to each decision (dl,..., dm),
namely, I(x, y) = Ej'=i (x, (d, ... dj-1),dj).
The parameters of the model, !
REk,
weight the features by their discriminative ability.
a is precisely what is learned during training. The perceptron algorithm is a convenient
1
2
3
4
5
6
7
8
9
10
11
12
main verb
any verb in the clause
all verbs, in sequence
clausal spine
full clause tree
complementizer words
label of the clause root
each word in the subject
each word in the object
function words in the spine
adverbs
parent and siblings of clause root
1
2
3
4
5
6
7
does the spine have a subject?
does the spine have an object?
does the spine have any wh words?
the labels of any complementizer nonterminal
the labels of any wh nonterminals in the spine
the nonterminal labels SQ or SBARQ
the nonterminal label of the root of the spine
Table 4.1: The aspects of a Chinese clause (left) and of an English AEP (right) that are incorporated into
features. Additional features can easily be added in the future.
choice because it converges quickly, generally only requiring a few iterations over the training
examples [9],[12]. I specifically use the averaged perceptron algorithm as presented in [12]
to learn the weights.
4.1.3.2
Feature Design
Training and translation both use a feature function 4(x, y) that maps a Chinese clause
x and English AEP y into a vector of features. This section describes that function. A
strength of discriminative models is the ease with which qualitatively wide-ranging features
can be combined; theoretically, a feature could be a function of any aspect of a Chinese
clause x, any aspect of an English AEP y, or any sort of connection between the two. Table
4.1 enumerates the complete set of Chinese clause aspects and English AEP aspects used to
construct features in the current model. The mapping from aspect to feature is for the most
part direct. For example, one aspect of a Chinese clause is its main verb - for every Chinese
main verb encountered in the training data, I created an indicator feature ¢i(x) that is True
if a clause x has that main verb, or False otherwise. The Chinese features are based on the
German features used in [1], while the English features are identical.
It is worth noting that a relatively simple set of features stretches a far distance. For
example, the function words in a Chinese clause's spine include prepositions, verbal aspect
words (such as the perfective aspect Tor the continuous aspect V), locatives, conjunctions,
and more. Chinese conjunctions often connect directly to the conjunctions of an English
translation, while at the same time, Chinese verbal aspect words and locatives often influence
the English main verb, including its tense.
Also note that the feature vector created for each <Chinese clause, English AEP> pair is
enormous: for example, an indicator feature is created for every single word in the corpus that
ever appeared within a subject or object. The weight vector a effectively filters uninformative
features, keeping only those that have predictive power.
4.1.4
Translation
The translation phase is outlined in the right half of Figure 4.1, and works as follows: An
input Chinese sentence is first segmented, parsed, and split into clause subtrees. An AEP,
representing high-level English clausal structure, is predicted for each Chinese clause independently using the trained model. Clausal modifiers - subject, object, and adjuncts - are
translated separately, using the Moses phrase-based translator [21. The rationale is to obtain
improved translation quality through syntactic analysis at a high level, while still utilizing
the typically good content translation phrase-based systems have to offer. Additionally, by
applying simpler methods to reduced expressions, my approach becomes less vulnerable to
parsing errors: high-level verbal structure must be accurate, but the system isn't sensitive
to low-level parsing details such as the inner structure of the subject and object.
During AEP prediction, the model is decoded using beam search. An AEP is built one
decision at a time, in the order (dl,..., dN), and at any given point a beam of the top M
decisions is maintained. The ADVANCE function is used to increment the beam one step
further. Partial decision sequences are ranked using the SCORE metric.
4.1.4.1
Gluing Clauses into a Sentence
A significant challenge of this approach is: after each clause of a Chinese sentence has been
translated into a full English clauses, modifiers and all, how should the clauses be combined
together to form the full sentence? This is a substantial problem that was not focused on in
this thesis. Instead, I put in a placeholder implementation, ready to be replaced by a more
sophisticated method.
The implemenentation glues English clauses together depth-first. That is, for a Chinese
sentence, consider the depth-first order of its clauses - the first clause in the sequence is the
one at the root of the tree. The gluer simply concatenates, in depth-first order, each clause
translation. Such an approach is slightly better for Chinese than the left-to-right gluing use
in [1]. For example, a Chinese clause very commonly has a lower relative clause that attaches
to its object:
IP
NP
VP
subject
V
NP
verb
CP
NP
relative clause
object
Depth-first order reorders these clauses correctly, by translating the higher clause and
lower clause independently and concatenating them: [subject verb object [relative clause]].
Section 5.3.1 talks more about the cases it can't handle and suggests improvements.
4.2
Chinese Clause Splitting
In the training extraction and translation stages of my system, a clause "splitting" step is
performed that takes a parse tree of a sentence and identifies all the subtrees that correspond
to clauses. For example, the green circles in Figure 4.2 indicate the roots of the clause
subtrees for the English sentence The one that organized the meeting did encourage other
members to meet the new speaker, but everyone was busy. Labeling the clauses depth-first,
the 1st clause is the one did encourage other members, the second is that organized the
meeting, the third is to meet the new speaker, and the fourth is but everyone was busy. The
clauses attach as follows:
[1: the one [2: that organized the meeting] did encourage other members [3: to
meet the new speaker] [4: but everyone was busy]].
I developed an algorithm for splitting Chinese sentences into clauses: given a sentence parse
as input, the clause markers (green circles in the diagram) are returned as output. The
program was designed by looking at hundreds of randomly drawn English and Chinese parse
trees from the corpus, to figure out several heuristic rules that would identify the Chinese
clauses most of the time. While I did no formal evaluation of clause splitting performance
(because I have no evaluation data to begin with), from observation I can say that it was
infrequently the case that my program would assign bad clause markings to a good parse
tree. Note that it is not necessary to give good output for 100% of the data; the aim here
is to be correct the majority of the time, by capturing the common cases with a few simple
heuristics.
A clause splitter for English has already been developed by Cowan et al. [1]. I use it
untouched to split my English parse trees. A contribution of this thesis is a novel technique
for splitting Chinese clauses. While part of the design is Chinese-specific, I believe the
approach taken is an effective way to split clauses of any language, including English.
My system splits clauses through the use of two stages of rules. First, the main verbs
in the parse are identified. Clause markers are assigned to each of these verbs. Second, the
clause markers propagate up the tree, the aim being for a given marker to stop once it has
reached its clause root. Figure 4.2 diagrams each of these stages. The red dashed rectangles
around the main verbs in the sentence are the result of stage 1. The thick red branches in
the parse tree illustrate the journey that these clause markers undergo, before reaching their
final destination at the root of their respective clauses. Stage 2 rules control this propagation
F-11nii~
rl--
UmA-~ Driin
Note
3
Lv]
encourage
CC
but
NP
)JP
other
members
NP
usy
the
meeting
Step 1: Identify clausal verbs:
organized, encourage, meet, was
new
speaker
Step 2: Propagate upwards to clause root:
s, S, SG, CCP
Note 1: The verb 'did' functions here as a modal verb, not a clausal verb.
Verbs such as these make clausal verb identification (Step 1) non-trivial.
clause marker will never pass through a noun phrase, because
A
2:
Note
noun phrases never constitute clauses. Propagation therefore ends.
Several other rules of this type come into play to control propagation.
Note 3: For each of the two branches indicated, propagation is prohibited:
a child cannot pass upwards to a parent that has already had a clause
marker pass through it.
Figure 4.2: Overview of the clause splitting program. While the program was designed for and tailored
to Chinese, an English example was chosen, for clarity, to illustrate. A first set of rules identify the main
(clausal) verbs in the sentence, as indicated by the red dotted rectangles. Clause markers are assigned to
each of these verbs, which then propagate up the tree, eventually stopping at the clause root. Notes 1-3
describe a few of the interesting parts of this program.
and eventual settling. The notes in the figure point to a few of the interesting parts of the
program, described in detail in the remaining sections.
4.2.1
Step 1: Main Verb Identification
This section reviews the first step towards clause splitting: clausal verb identification. The
goal is to identify each of the clausal verbs in a sentence. Each clausal verb corresponds
to its own clause subtree - stage 2 identifies that subtree. In figure 4.2, the clausal verbs
are organized, encourage, meet, and was. Stage 1 is non-trivial because of the existence of
non-clausal verbs, such as modal verbs in English and Chinese (did in figure 4.2), Chinese
stative verbs, Chinese verbs that behave like adverbs, and more.
Figure 4.2 actually shows a slight simplification of how stage 1 works. The actual program does something equivalent and slightly more complicated: it identifies the clausal verb
phrases (VP nodes in the parse tree) that are above the clausal verbs. The reason, in short,
is that the rules are easier to express that way. For example, consider the English modal
verb did and the clausal verb encourage. The words themselves, inside the parse tree, look
the same - they have the same part-of-speech tag VV. The VP nodes directly above them,
however, look quite different.
VP
VV
VP
did
VV
NP
SG
encourage
The VP above did has another VP - the VP above encourage - as one of its immediate
children. In fact, this is the general pattern used to discern modal verbs from other verbs: a
modal verb has another VP as one of its children. This is a standard for modal verbs used
in both the English and Chinese Penn Treebanks. The VP above encourage, on the other
hand, does not fit this pattern. In this example, the VP above encourage would be marked
as a clausal-VP, and the VP above did would not.
Algorithm 1 Clausal Verb Identification
input: the nodes of a Chinese parse tree
procedure isClausalVerbPhrase? (node)
if node is not a VP then
return false /* only targeting verb phrases */
if one of node's children is a VP then
return false /* VP corresponds to a modal verb */
for each word spanned by node, that is not spanned by a descendant VP of node do
if word is a verb then
if word is a VV, VC or VE then
return true /* any non-stative verb treated as a clausal verb */
else
/* need to handle stative verbs (VAs) specially */
if word has ancestry CP--IP--I1 or more VPs]-VA then
/* short CP pattern */
return false /* word acts more like an adjective, less like a clausal verb */
else if word has ancestry DVP--VP--VA then
/* adverbial pattern */
return false /* word acts as an adverb, not a clausal verb */
else
return true /* stative verbs are considered clausal by default */
return false /* default: didn't find any reason to consider node a clause root. */
The problem of stage 1 is therefore restated as follows: find, for each clausal verb, the
VP ancestor that it corresponds to. Such VPs are here named clausal VPs. Sensitivity to
modal verbs are one detail of the program. The full procedure used in this thesis is outlined
in algorithm 1.
The algorithm begins with two simple tests. First of all, to be a clausal VP, a parse tree
node must of course be a VP. Secondly, the node must not have a VP as one of its children
- such a node corresponds to a modal verb. The meat of the algorithm considers each word
that is both spanned2 by this node and additionally not spanned by a VP descendant of
this node. The reason for disregarding the words spanned by a descendant VP is that if any
of those words are clausal verbs, then the descendant VP is what should be recognized as
a clausal VP, rather than the current node under consideration that is higher up the tree.
2
The words in the span of a node are simply those that are covered by its subtree.
Considering each word individually, the algorithm looks for a reason to mark the current
node as a clausal VP - it looks for words that behave syntactically like clausal verbs. First
of all, if a given word is the copula verb (VC) to-be, the existential verb (VE) to-have, or a
basic verb (VV), the algorithm ends and the current node is marked as a clausal VP. The
remaining class of verbs, stative verbs (VA), are treated more specially. Chinese stative verbs,
also called static verbs, are like adjectives but can also serve as the main verb in a sentence.
For example, consider the stative verb ,meaning
dark. The sentence )r,,,Ttranslates as
the sky (i7k) has become dark (M).3 In this case Mserves as a clausal verb. Yet in other
cases, Mis translated simply as an adjective in English, as in the dark coat for example.
Consider a few more examples, using the stative verbs Tl14i[broad],and PJ[organic].
The first example connects
f-r'1to
gf'-[Pacific Ocean] using the DE complementizer
phrase construction. In this pattern, here named a "short complementizer phrase," the stative
verb is almost always translated into an adjective in English.
Chinese
r IA4J@
Tf-
Literal
broad DE Pacific Ocean
English
The broad Pacific Ocean
And yet in a very similar sentence, TI"Jis treated like a clausal verb, and translated into
a copula+adjective:
Chinese
Literal
English
k#--flPacific Ocean very broad
The Pacific Ocean is very broad
Chinese stative verbs can also behave like adverbs. In the following pattern, here called
an adverbial pattern, an ADVERBIAL-DE function word 4 effectively converts a stative verb
into an adverb: organic becomes organically.
3
The Chinese aspect word T denotes in this context an emergent observation of the speaker - that the
sky is just becoming dark.
4Despite being pronounced the same, ADVERBIAL-DE tis a distinct Chinese character from the complementizer DE V,used in the first example. I therefore label it differently, by prefixing "ADVERBIAL-".
Chinese
WJILc /fi#
Literal organic ADVERBIAL-DE combine rise come
English
Organically bring together
Any time a stative verb is encountered, the algorithm judges whether it is either part of a
short complementizer phrase pattern or an adverbial pattern. If it is, the algorithm decides
it probably isn't a clausal verb and moves on to the next word. Otherwise, the algorithm
concludes that it is a clausal verb and returns, marking the current node as a clausal VP.
The algorithm presented was designed in several iterations, inspired in part by the inspection of a few hundred Chinese sentence parses, and in part by the annotation standards
of the Penn Chinese Treebank 114]; while the cases covered here are both imperfect and
certainly incomplete, they are the rules that I found to come into play most often.
4.2.2
Step 2: Clause Marker Propagation Control
Each clausal VP identified in step 1 corresponds to a clause subtree, with the subtree root
being some ancestor of the VP. Step 2 finds that subtree. The idea is to consider a "clause
marker" to initially be placed at each clausal VP node, and to let those markers propagate
up the tree from ancestor to ancestor, one step at a time, until the markers finally settle at
their respective clause subtree roots. Figure 4.2 illustrates this propagation: starting at the
verbs, the markers move upwards and settle at the roots of the four clauses in the sentence.
Step 2 consists of a series of rules that control when upwards propagation should happen
and when it should stop.
Algorithm 2 describes the rules in detail. At the highest level, the algorithm starts with
the list of clausal VPs identified in step 1, ordered depth-first with respect to the sentence
parse. For each VP node, considered in order, a propagateSplit routine floats the clause
marker (also referred to as a split point) up the tree, until it can propagate it no longer. The
canPropagateSplit? boolean procedure, the bulk of the algorithm, makes the core decision
of whether or not a given node can propagate its marker to its parent.
Algorithm 2 Clause marker propagation control.
input: splitList, the initial set of clausal VP nodes discovered in step 1, ordered depth-first
initialization:
visitedList +- splitList.copy()
for node in splitList do
propagateSplit (node, visitedList)
procedure propagateSplit(node, visited)
if canPropagateSplit?(node, visited) then
/* do propagation: pass split marker up to parent */
node.isSplit +- false
node.parent.isSplit -- true
/* add parent to the visited list */
visited.add(node.parent)
/* recurse on parent */
propagateSplit (node.parent, visited)
procedure canPropagateSplit?(node, visited)
if node is the root node of the sentence parse then
return false /* reached the top the tree */
if visited list contains node.parent then
/* can't propagate upwards to a node that has already been propagated through */
return false
alwaysPropagateTypes <- {PP, CP, LCP, CCP, DNP, DVP, UCP}
if node.parent type is in alwaysPropagateTypes then
/* always pass to a node.parent with type in this set, regardless of node's type */
return true
if node type is an IP then
if node has a BA sibling then
return true /* Chinese BA pattern, always pass upwards */
if node type is a VP then
if node.parent type is a VP or IP then
return true /* pass VP to VP and VP to IP */
return false /* default: no reason to propagate */
A visited list - the list of nodes that a marker has propagated through at some point
in time - is used throughout the algorithm to enforce one essential constraint: nodes that
have previously had a marker propagate through them cannot later allow another marker to
propagate through. The necessity of such a constraint is made apparent through a frequently
occurring pattern: Chinese parallel verb phrase constructions. It is common in Chinese to
take several verb phrases and combine them to form multi-clausal sentences. The effect is
similar to the parallel English sentence Jordan went to the grocery store, bought several loafs
of bread, waved to the phramacist, and drove home. The Chinese syntax, similar to English,
is diagrammed to the left of the figure below:
IP
IP-SPLIT
VP-SPLIT
VP-SPLIT
VP
NP
VP
NP
VP-SPLIT
VP
VP-SPLIT
VP-SPLIT
The three VPs at the base, taken together, form a single VP that becomes the predicate
of the subject noun phrase (NP) Jordan,forming a sentence (IP) at the root. The three VPs
at the base are what step 1 would identify as the clausal VPs, and where propagation would
begin. The left tree shows this initial position of the clause "split" markers.
The right-hand side of the tree shows the desired final position of the clause split markers:
the left-most marker propagates all the way up to the root IP. The remaining markers stay
where they are, effectively labeling the remaining VPs as the roots of the remaining clauses.
Such behavior is easy to achieve through the use of a visited list, because the VP directly
above the base VPs is visited once by the first base VP, and thus disallows the other VPs
to propagate their marker upwards.
There exists a whole host of similar parallel clause constructions in Chinese, all of which
make use of the only-propagate-once constraint. Several remaining conditions further control
how a node can pass its clause marker up to its parent. These conditions, listed in turn in
Algorithm 2, are described in order.
Firstly, there are certain node parent types, members of the alwaysPropagateTypes list in
Algorithm 2, that any node, regardless of type, will always pass its clause marker upwards
to. One such type is a prepositional phrase (PP). Consider an English example (Chinese
works similarly):
PP
P
PP-SPLIT
VP-SPLIT
in
P
VP
in
deciding moral issues
deciding moral issues
The deciding moral issues VP forms the core of the clause. The PP directly above the
VP serves to extend its meaning by adding the function word in. It is this full fragment in
deciding moral issues that we want to capture as a clause. To generalize, the children of a
PP should always pass their marker upwards. CPs, LCPs, and the remaining members of
the alwaysPropagateTypes list serve similar functions, wrapping a function word around a
simpler clause core. The Chinese Penn Treebank documentation [14] may be consulted to
learn more about each of these types.
If a node's parent type is not a member of the alwaysPropagateTypes list, the remaining
conditions of Algorithm 2 inspect the node's type itself. At this point, if the node is an IP
(meaning that its subtree can function as a complete sentence), the split marker typically
should not be passed upwards. However, there's one exception explicitly covered in Algorithm
2: The Chinese BA pattern. An example illustrates:
Chinese
WE-iUR±
-_f
Literal He BA book put on table above.
English
He put the book on the table.
There is a roughly equivalent way of saying this sentence that does not use BA. Semantically, the BA pattern is used to emphasize or draw attention to the object (book) of the
sentence: a more accurate translation of the above example is perhaps Please take the book
and put it on the table. The BA function word doesn't quite behave like a verb or a preposition. In fact, the Penn Chinese Treebank designers assigned BA its own part-of-speech tag
and chose a particular standard for annotating BA syntax. The tree to the left illustrates:
VP
BA
E
IP -SPLIT
VP-SPLIT
BA
IP
I
The full clause should contain the BA function word within it. Therefore, when an IP
has a BA node as a sibling, it is necessary that the IP is able to propagate a clause marker
upwards. The tree to the right illustrates the result of such a propagation.
The final condition in Algorithm 2 tests whether the current node is a VP. Already
given that the VP's parent is not a member of the alwaysPropagateTypeslist, this condition
restricts the nodes that a VP may pass upwards to: it may pass upwards only to another
VP, or to an IP. These are cases when a VP should definitely pass a clause marker upwards;
cases that are less certain or currently unforeseen are explicitly ruled out, as a conservative
measure. The default, at the bottom of Algorithm 2, is to halt propagation.
4.2.3
Design Rationale
As currently presented, clauses are picked out of a Chinese parse tree deterministically using
hard-coded rules. One design goal is therefore to represent the clause splitter as compactly
and expressively as possible, rather than making it a large, unwieldy enumeration of repetitive rules. After experimenting with several representations, I found the current technique
of first identifying clausal verbs and then controlling their propagation upwards to work the
most effectively: I found that rules of each kind - rules for clausal verb identification and
for propagation - could be expressed succinctly and extensibly. My original approach was
to try pattern matching entire clause subtrees: specifying several clause subtree patterns,
and reporting, for a given sentence, each subtree that matched one of these patterns. I
abandoned this design once I realized that, taken together, relatively small variations between different Chinese clause structures would quickly lead to a combinatorial explosion of
necessary patterns.
A second, less prominent goal of the design is to identify only the Chinese clauses that
are frequently translated into English clauses: one example already mentioned how Chinese
stative verbs can be translated as adjectives in English. Because later stages of training data
extraction will attempt to align Chinese clauses to English clauses in a parallel sentence,
it is desirous to identify only the Chinese clauses that generally map to English clauses.
Theoretically speaking, one would want to design a Chinese clause splitter that had nothing
to do with English syntax; this second goal is a practicality informed by how I use this tool.
Ruling out Chinese verbal patterns that almost always correspond to adjectives or adverbs
in English - such as short CP phrases or adverbial-DE (DVP) phrases - is therefore an
important part of the splitter.
4.3
Clause Alignment
To recap, training data extraction is the phase by which a parallel corpus of Chinese/English
sentences is processed into a training set of <Chinese clause, English AEP> pairs, one pair
per parallel clause. After parsing, clause splitting is the first step in this extraction: it takes
Chinese and English parse trees and independently splits them into clause subtrees. The
second step, clause alignment, connects these clauses - it figures out which Chinese clause
likely corresponds to which English clause, producing as output a set of Chinese/English
clause pairs. Clause alignment is reviewed in this section.
I begin with a general hypothesis of any clause-by-clause translation system, including my
own and Cowan et al's German-to-English translator [1] that inspired it: It is nearly always
reasonable to translate one language into another at the clausal level, keeping the number
of clauses constant. While translators often do not do so, and while it is not necessary to do
so, it is nonetheless a restriction that allows reasonable translations. For example, consider
the Chinese sentence below and two alternative English translations:
Chinese
Literal
:
,
not can sustain DE energy consumption
English-1
energy consumption that cannot be sustained
English-2
unsustainable energy consumption
English translation 1 is at the clausal level: the Chinese relative clause T;- J
*4Jis
translated into an English relative clause that cannot be sustained. English translation 2
drops a clause by translating the Chinese relative clause into an adjective unsustainable.
Both translations are good. The clause-by-clause translations my system produces, however,
would allow only translations such as English 1 that preserve the number of clauses.
This point is worth mentioning here because the training data that I extract from a
corpus - that my translator will learn from - had also better be clause-by-clause translations.
Because my system is learning how to translate a Chinese clause into an English clause, the
Chinese and English clause pairs that are input to the learning algorithm should in fact be
translations of each other.
Such 1-to-1 translations between a Chinese clause and an English clause are common in a
parallel corpus. The relative clauses in the Chinese/English-1 example pair are an example.
Clause alignment attempts to report these clauses, to be used later as training data. Other
times, also quite frequently, a Chinese clause is translated into several English clauses, or the
other way around. The Chinese/English-2 example pair is an example of that phenomenon.
Such translations cannot be well represented as pairs of aligned clauses, and should therefore
not be used as training data for a clause-by-clause translator. Clause alignment attempts to
reject such cases, dismissing them rather than incorporating them into the training data. In
a nutshell, Clause alignment aims to:
1. Align Chinese and English clauses that are likely translations of each other.
2. Filter out everything else.
The next sections review in detail how I attempt this goal. The technique is similar to that
of Cowan et al [1].
4.3.1
Method Used and Variations to Consider
To formalize clause alignment, the input of the problem is a set of clauses from a Chinese
sentence and a set of clauses from a parallel English sentence. The method described here
additionally requires a third input: word alignments for that sentence. Roughly speaking,
word alignments reveal which Chinese and English words are likely translations of each other;
alignments are similar to entries in a bilingual dictionary. Section 4.3.2 reviews how these
word alignments are generated.
Given these inputs, the output is a set of pairs of clauses, where each pair contains a
Chinese clause and the English clause that it is likely a translation of. Clauses that can't
decently be paired up are absent in this set of pairs, and thus implicitly rejected. Note that
if the sentence pair contains no conceivable parallel clauses, the output will be an empty set.
Algorithm 3 begins with clause alignment's formal inputs and outputs, and a recap of
its high-level goal. The algorithm works by finding each Chinese clause that contains only
words that align to one English clause - this is my best evidence that a clause pair is in fact a
complete clausal translation. The algorithm gets there in three steps. Firstly, if the Chinese
sentence has a differing number of clauses than the English sentence, the algorithm rejects
the entire sentence, returning an empty set of pairs. This is a conservative measure that
helps maintain confidence in the quality of the clause pairs that are extracted. If none of the
words of a Chinese clause have alignments, the clause is thrown away. Similarly for English
clauses. If a Chinese clause contains words that align to different English clauses, it and all
respective English clauses are thrown away. Similarly for English clauses. By this point, the
only clauses left are those that align 1-to-1: every Chinese clause contains words that align
only to words within one English clause, and vice versa. These 1-to-1 clause alignments are
precisely the pairs returned as output.
The rationale behind this design is a conservative one: while much potential training
data will be rejected as a result of steps 1-3, the clause pairs that are extracted will more
likely be clausal translations. Several less conservative variations of the design remain to be
explored. It might turn out that rejecting every sentence pair that has a differing number of
clauses is too conservative a measure for Chinese to English translation - the gain in quality
isn't worth the amount of data thrown away. More risky, it may be worthwhile to be even
less conservative by allowing clause pairs that have the majority of words aligned between
them, rather than the totality. Variations on the design at this level so far haven't been
evaluated.
The high-level description of clause alignment is complete. The question that remains,
explained in section 4.3.2, is how the word alignments of the sentence were generated in the
first place.
4.3.2
Word Alignments: Guessing Conservatively
At the core of the word alignment submodule is the word-based statistical machine translation technology from the early 90s: The IBM models5 [8]. The GIZA++ toolkit [insert
reference] is a free software package for training the IBM models. I use it to produce word
alignments. Without getting into the details of the models, or how they can be estimated,
the basic idea is to look at an entire corpus of parallel Chinese and English sentences, and to
align the Chinese and English words that co-occur the most frequently. Word alignments are
one-way: IBM models align words in a source language to words in a target language, under
5
The IBM models are described briefly in 2.2.1.1.
Algorithm 3 Top-level clause alignment procedure
Goal:
For a Chinese sentence and parallel English sentence, figure out which Chinese
clause corresponds to which English clause. Return only the clause pairs that
have good evidence of being translations of each other.
Input:
The set of clauses from a Chinese sentence parse.
The set of clauses from a parallel English sentence parse.
The word alignments for this sentence, generated by produce ContentWordAlignments.
Output:
A potentially empty set of clause pairs. In each pair is a Chinese clause and the
associated English clause that it is aligned to.
1.
If number(zhClauses) = number(enClauses), reject this sentence entirely.
Return an empty set of pairs.
2.
If for some Chinese clause z, no words within that clause have an alignment to
English words, remove z from zhClauses. Vice versa for English/Chinese.
3.
If words within some Chinese clause z align to words within more than one
English clause, remove z from zhClauses and all aligned clauses from enClauses.
Vice versa for English/Chinese.
Invariant: After steps 1-3, for each clause z in zhClauses,
> At least one word in z aligns to a word in some English clause e.
> Each other word in clause z either also aligns to a word in clause e, or
doesn't have an alignment.
Return:
A set of pairs. For each Chinese clause z, return pair<z,e> for that corresponding
English clause e. (If zhClauses is now empty, so will be the output.)
En - Zh word alignment
Zh - En word alignment
zhongguo de qingkuang qianchawanbie
~ place to place
but the situation in China differs from
zhongguo de qingkuang qianchawanbie
I
T T T
T
I--,C
the situation in China differs from place to place
but
Combine Operation:
Intersection
Zh- En word alignment
but the situation in China differs from place to place
zhongguo de qingkuang qianchawanbie
to Chinese (En
Figure 4.3: One-way word alignments from Chinese to English (Zh -- En) and English
English word to be aligned
Zh), intersected into a 1-to-1 alignment. The Zh -- En alignment requires each
for En - Zh. Intersection
to some Chinese word, often leading to spurious alignments. The reverse holds
conservatively keeps only those words that are aligned both ways.
A
the assumption that the target language was probabilistically generated from the source.
source word can be aligned to zero, one or many target words, but it must be the case that
each target word was generated by exactly one source word.
A concrete example helps to illustrate. Figure 4.3 displays a Chinese-source-to-Englishtarget word alignment in the upper left, and a mirrored English-to-Chinese alignment in the
upper right. Consider the upper left. There is a word alignment for each target (English)
word: there are exactly 10 alignments total, for each of 10 English words. Some source
(Chinese) words align to one English word, while some align to many, but in this one-way
alignment, there must be a single word alignment for each English word. This is because the
English sentence is assumed to be generated by the Chinese sentence - more specifically, an
assumption of the model is that each English word was generated by some Chinese word.
The English-to-Chinese alignments in the upper right display an alignment going the other
way: each Chinese word is aligned to exactly one English word.
The restriction that every target word must have a source word often leads to dubious
alignments. For the purposes of clause alignment, fewer but higher-quality alignments are
more desirable. Following Cowan, I run GIZA++ both directions and take the intersection
of the two sets of alignments, keeping only those word alignments that are present in both
directions. The bottom of figure 4.3 diagrams intersection. Note that intersection produces
1-to-1 word alignments: each Chinese word is aligned to exactly one English word, and vice
versa.
So far, the algorithm used within clause alignment has just about been fully described:
separately run GIZA++ in each direction and return the intersection. There is one additional sophistication reviewed here. Before running GIZA++, I filter the corpus of parallel
sentences that it reads to contain content words only. A content word is a noun, verb, adjective or adverb by my definition. The rationale is that in clause alignment - figuring out
which Chinese clause is a translation of which English clause - I care that the content of the
clauses is the same on both sides: the main verbs of the clauses are aligned, the subject and
object nouns are somehow aligned, and so on. Function words often have shakier alignments
between languages, providing less to depend on. In the end, all I want to be convinced of
is that the clause pairs aligned are in fact valid translations. Content word alignments are
more dependable, and sufficient for estimating whether one clause is a translation of another.
Algorithm 4 outlines the full word alignment procedure.
4.4
AEP Extraction
The ultimate goal of this thesis is to incorporate clause structure - and the way it differs between Chinese and English - into a statistical model for Chinese-to-English machine
translation. Following Cowan et al [1], the Aligned Extended Projection (AEP) is the formal
model that I adopted for this task. The skill to be learned becomes how to predict an English AEP given a Chinese clause: a learning algorithm inspects a body of <Chinese clause,
English AEP> input/output training examples, and attempts to learn a mapping function
from Chinese clauses to English AEPs.
Algorithm 4 Clause alignment helper procedure: produceContentWordAlignments
Input:
A corpus of Chinese sentences and parallel English sentences
Output:
For each sentence pair, a 1-to-1 alignment between a subset of its Chinese and
English words.
Filter Chinese and English sentences, keeping content words and removing everything else. A content word is a noun, verb, adjective or adverb by my definition.
With filtered corpus as input, produce Chinese to English (Zh -- En) and English
to Chinese (En -+ Zh) word alignments using the GIZA++ Toolkit, IBM model
4.
Take the intersection of the Zh -- En and En -- Zh word alignments to produce
a single, 1-to-1 alignment between a subset of the Chinese and English content
words.
Return:
The intersection alignment for each sentence.
Before any learning can begin, a body of training examples must exist. Clause alignment,
the previous stage of training data extraction, produced a parallel corpus of <Chinese clause,
English clause> translations.
AEP extraction is the final stage: from each raw pair, it
extracts an English AEP, yielding finally a <Chinese clause, English AEP> training example
ready for the learning algorithm.
AEP field
STEM
Description
Clause's main verb, stemmed.
SPINE
Syntactic tree fragment capturing clausal argument structure.
WH
A wh-phrase string, such as in which, if present in the clause.
MODALS
List of modal verbs, such as must and can, if present.
SUBJECT
- The Chinese modifier that becomes the English subject, or
- NULL if English clause has no subject, or
- A fixed English string (e.g.
there) if English subject doesn't correspond to a
Chinese modifier.
OBJECT
MOD[1...n]
Analagous to SUBJECT
Alignment positions for the n Chinese modifier phrases that have not already been
assigned to SUBJECT or OBJECT.
Table 4.2: AEP Datastructure. The first four fields reflect the English clause only. The remaining fields
connect the English clause to the Chinese clause that it was translated from.
AEPs and their related terminology are conceptually reviewed at length in Chapter 3.
Table 4.2 is a schematic of the actual AEP datastructure extracted in this stage and evena
tually predicted during translation. To summarize, an AEP models the translation of
the
Chinese clause into an English clause. Part of the AEP, fields 1-4 in table 4.2, describes
English clause's anatomy. The remaining fields connect the English clause to the Chinese
clause that it was translated from: which parts of the Chinese clause became the English
subject and object, and how the remaining parts became positioned when translated into
English. Section 4.4.1 explains how fields 1-4 are extracted. Section 4.4.2 covers fields 5-7.
Algorithm 5 sketches out the complete extraction at a high level.
4.4.1
Identifying the Pieces of an English Clause
The first half of AEP extraction looks only at the English clause, to identify all of its relevant
pieces: its main and modal verbs, subject, object, and adjuncts, its function words, and its
high-level syntactic spine. Some of these pieces to be identified are illustrated in the following:
over the next few years
change
in some sense
but the research climate must
I I1
I
I
I
I
L J I
function word
I
subject
main verb
adjunct
modal verb
adjunct
The input that the rules of this stage analyze to make these identifications is an English
clause subtree, such as the one below.
CCP
but
NP-A
the
V
S must
research
climate
VP
VP
PP
>
in
some sense
V
change
PP
over
the next
few years
This particular example clause will be used in subsequent sections to illustrate AEP
extraction step by step. It takes change as its main verb, and has a single modal verb must.
Algorithm 5 AEP Extraction
Goal:
The over-arching purpose of this system is to learn how to predict an English AEP
from a Chinese clause. A learning algorithm attempts this task by inspecting a
corpus of <Chinese clause, English AEP> input/output pairs. Clause alignment,
the previous step of training data extraction, produces a corpus of <Chinese
clause, English clause> translation pairs. This final algorithm processes these
pairs into the AEP formalism that I aim to learn.
Input:
<Chinese clause, English clause> subtree pair, and word alignments
Output:
<Chinese clause, English AEP> training example
1. Identify English Clause Anatomy
(a) Identify main verb
(b) Identify modal verbs
(c) Identify clause spine structure, including subject and object attachment
(d) Identify adjuncts
2. Identify Chinese Modifiers
- Same technique as in step 1, but on Chinese side.
- Only need the list of Chinese modifiers (includes subject, object and adjuncts)
3. Align Chinese Clause to English Clause
(a) Find the Chinese modifier that best aligns to the English subject, if any.
(b) Find the Chinese modifier that best aligns to the English object, if any.
(c) For each remaining Chinese modifier z, determine the position of the best aligned
English modifier e:
i. Find the best e using z's word alignments
ii. Report e's position in the English clause. 5 possibilities:
t> PRE_SUBJECT: before subject
> POST_SUBJECT: between subject and first modal
u> IN MODALS: between first modal and main verb
> POST VERB: after main verb
> DELETED: if no aligned English modifier e could be found
Its subject is the research climate, and it has no object because change is intransitive. Two
adjuncts, prepositional phrases in this case, further add information to the clause: in some
sense and over the next few years. These are adjuncts and not arguments (subject or object)
because the clause would still be grammatical were they to be removed. Lastly, because this
particular clause was taken from a conjunction, it incorporates the function word but at the
top of the tree, giving some picture of how it can be combined with other clauses.
As Algorithm 5 illustrates, these pieces are identified in a fixed order. For example,
modal verb identification requires the main verb to have already been identified. Each step
is reviewed in order.
4.4.1.1
Main and Modal Verb Identification
The first two steps identify the main verb and a possibly empty series of modal verbs,
respectively. The figure below illustrates the result of these identifications:
CCP
s
cc
but
VP
NP-A
A
the
research
climate
u
PPVP
in
some sense
PP
over
the next
MainVer
few years
Main verb identification is governed by a few rules. If there is only one verb in the clause,
it is identified as the main verb. If there is a verb in the subtree located at a shallower depth
than another verb, it is dismissed as a possibility (must is dismissed because it is shallower
than change, for example). Ignoring the details, a few Penn Treebank-dependent rules come
into play to further filter the main verb candidates based on their ancestors. In the end, if
more than one verb candidate still exists, the right-most verb of the clause is chosen because
modal verbs occur before the main verb.
Modal verb identification is similar. The algorithm marks as a modal verb
every verb v
other than the main verb, subject to two restrictions: (1) v occurs before
the main verb, and
(2) v and the main verb exist within the same verb phrase. Consider
must in the example
above. Restriction (1) is clearly met. Restriction (2) is also met because
the VP parent of
must also spans the main verb.
4.4.1.2
Spine Extraction and Subject/Object Detection
After main and modal verb identification, the next step is to extract the syntactic
spine, or
backbone, of the clause. The spine reflects the essential argument structure
of the clause:
how the subject and object attach. It also illustrates how the main verb projects
function
words around itself, such as but in this example. The spine does not include any
adjuncts
that might have been present in the original clause; it is a high-level
simplification. Similarly,
it does not include modal verb structure. Note that both modals and adjuncts are
modeled
elsewhere.
The following illustrates an example of spine extraction:
CCP
CCP
NP-A
the
research
climate
Extraction
VP
V
must
CC
S
but
PP
NP-A
VP
in
some sense
V
change
VP
V
PP
over
the next
few years
Subjecttachment
Poon
The subject of the clause, the research climate, is identified
and reduced to its root,
an NP-A'. That the spine does not have an NP-A attachment point after the
main verb
indicates that change is an intransitive verb. No trace of the adjuncts can be seen
on the
spine. Lastly, the modal verb syntax is dropped, leaving only the position of the main
verb
V. In general, modal verb syntax is reduced in the following way:
6
NP-A stands for Noun Phrase Argument. The Collins parser identifies these clausal
arguments.
VP
VP
V
v
VPI
main
V
modal-1
...
I
I
modal-2
VP
V
main
The rules that control spine extraction work in several steps. First, the spine is initialized
to the ancestral line of the main verb: CCP -- S -+ VP -- VP -- V in this case. Next the
spine is "grown" to include function words that branch directly off of the initial spine. Subject
and object attachment points are added to the spine by looking for NP-A nodes that connect
directly to the current spine. Finally, modal verb syntax is removed if present, leaving only
the attachment of the main verb. The rules are slightly complicated by treebank-dependent
details not mentioned here.
Item 3 in Table 4.2 - the WH field - might seem like an odd entity. It is an explicit part
of the AEP because English WH words (who, what, when, etc) are so prominent in English
clauses. WH words don't quite behave like subject and object, and don't quite behave like
function words, so they become their own field. Spine extraction additionally identifies these
words, much like it identifies function words.
4.4.1.3
Adjunct Identification
The final step produces a list of English adjuncts in the order that they appear. In this case,
the adjuncts are in some sense and over the next few years:
CCP
S
CC
but
NP-A
A
VP
V PP7
must
the
research
climate
VP
me sens
V
AdjAfiS
PP
the next
While English adjuncts aren't present anywhere in Table 4.2, they are used in section
4.4.2 to help fill out fields 5-7. Adjunct identification works simply: take each child of each
spine node, and identify it as an adjunct if it isn't already considered some other part of the
clause structure.
4.4.2
Connecting Chinese and English Clause Structure
The first half of an AEP is strictly on the English side: the main verb, the clause backbone
structure, and so on. The remainder of the representation, described here, is the alignment
information to the Chinese clause that it is a translation of: which parts of the Chinese
clause became the English subject and object, and the positions that the other parts moved
to when translated.
More formally, I wish to track the movement of the Chinese clause's modifiers: its arguments (subject/object) and adjuncts. The information that I wish to extract is described
below by example:
CCP
CC
S
but
NP-A
SV
must
the
research
climate
VP
:
PP
VP
in
some sense
Alignment Position:DELETED
no modifier counterpart
In English clause
~Ti5j
i
Chinese Modifier 1
II
Chinese Modifier 2
Alignment Position:POST VERB
aligned Englishmodifier is aii adjunct
hat appears after the main verb
I
Chinese Modifier 3
Chinese Modifier 4
Alignment Position: IN MODALS
aligned Englishmodifier Is an adjunct
between modal and main verb
In this example, the Chinese sentence has four modifiers. The second modifier becomes
the English subject when translated. The translation of the first modifier - a modifier near
the beginning of the Chinese sentence - is positioned at the end of the sentence on the
English side. The third modifier, between the Chinese modal and main verb, translates
to an English modifier at the same position. The final modifier has no correspondence in
English: it is deleted from the English translation.
The reason this stage is non-trivial is that Chinese and English clauses can be structured quite differently from each other. In this example, the English translation uses the
intransitive verb change while the Chinese uses the verb-object combination occur changes.
This leads to a modifier on the Chinese side - the object - that corresponds to no English
modifier, being incorporated instead into the English main verb. A Chinese modifier near
the beginning of the sentence moves to the very end on the English side. While not pictured
here, the Chinese subject is not always the English subject - it may be the case that a Chinese adjunct becomes the subject, and the Chinese subject drops off or becomes an English
adjunct. Similarly for Objects.
Steps 2-3 in Algorithm 5 illustrate how this Chinese modifier movement information is
actually extracted. The Chinese modifiers must first be identified to begin with. Step 2 does
this through an analysis just like that of Step 1, but on the Chinese side. Step 3 uses word
alignments between Chinese modifiers and English modifiers to make its decisions. For a
given English modifier, the best aligned Chinese modifier is defined to be that which has the
greatest number of aligned words to the English modifier.
First, step 3 finds which Chinese modifier best aligns to the English subject, and fills
out the subject field of Table 4.2. If the English clause has no subject attachment point, it
assigns NULL to the subject field. If the English subject has no correspondence to a Chinese
modifier, the field is assigned to that English subject string rather than to an alignment
pointer. For example, English often uses the subject there, as in there exist or there are,
of which there will be no corresponding Chinese modifier. Step 3 fills in the object field
completely analogously to the subject field.
Lastly, for each remaining Chinese modifier that wasn't aligned to the English subject
or object, step 3 first (1) finds the best aligned English modifier, and then (2) records
the position of that English modifier in the English clause. These positions fill out the
Modifier[1l...n] array of table 4.2. 5 positions are possible: PRE_SUBJECT if the aligned
English modifier is before the English subject, POST_SUBJECT if the modifier is after
the subject but before the first modal verb, IN_MODALS if it is between the first modal
and the main verb, POST_VERB if it is after the main verb, or DELETED if there is
no correspondence to an English modifier and the Chinese modifier is dropped from the
translation.
A Note about Word Alignments
Recall that clause alignment used the intersection between Zh -- En and En -- Zh word
alignments as part of its input. This was a conservative measure: the intersection improves
quality at the cost of fewer alignments between words. I initially used intersected word
alignments for AEP extraction as well, but quickly found intersected alignments to be too
conservative. Far too often, a Chinese modifier had to be DELETED because it had no
words at all that aligned to the English clause. By using En -+ Zh word alignments instead,
such that every Chinese word has some alignment to an English word, I was able to cut
down on the rate of modifier deletion, likely at the cost of some added noise.
Intersected alignments work well in clause alignment because a clause spans a relatively
large amount of words. The more words there are, the more likely there is to be at least
a single alignment, which is all that is needed to make an association. On the other hand,
intersected alignments work much poorer in trying to line up Chinese and English modifiers
because each modifier often spans only a few words. It is more likely that none of those
words will have alignments.
Chapter 5
Experiments
Chapter 4 described the system and its parts in detail. Chapter 5 probes this system through
experiments to evaluate its performance. In short, while further work, outlined in 5.3, must
be done before overall performance of this system could potentially be competitive, some
intermediate results prove hopeful. At a clause-by-clause level, the system predicts AEP
structure quite promisingly. An experiment supports this claim in section 5.1, with sample
output shown and discussed in section 5.2. It is when (1) the many independently translated
clauses of a sentence are reassembled, and (2) Chinese clausal modifiers are translated and
inserted back into their clause - orthogonal problems not focused on in this thesis - that
performance really suffers. Part of the problem lies in the overly simplistic gluing algorithm
used to assemble translated clauses. An extension that might remedy this problem is described in 5.3.2. Part of the problem lies in agreement between the different (independently
translated) clauses of a sentence, including tense, person and number agreement. Another
equally severe shortcoming lies in agreement between each modifierl and its surrounding
clausal context. It is certainly a large problem that the clauses of a sentence currently don't
inform the translation of each other, and that clausal context doesn't influence modifier
translation. An extension for combatting these sorts of contextual problems is described in
section 5.3.1.
1Recall
that each modifier is independently translated using a phrase-based system.
5.1
AEP Prediction Accuracy
This thesis focuses on predicting English clause structure from Chinese clause structure
through the Aligned Extended Projection representation.
My first experiment therefore
measures the accuracy in predicting these AEP structures, controlling as much as possible
for corrupting factors. Chinese parse quality, for example, is frequently erroneous at the
clausal level - and frequently the misparse stems from segmentation errors made earlier.
But bad parses are disregarded in this analysis because the desired measurement is AEP
prediction accuracy for good input.
I inspected 150 clauses from the test data by hand. Of those clauses, 55 were misparses
that were disregarded. I used the remaining 95 to estimate predictive accuracy for three
AEP fields: main verb, spine and alignment prediction. The results are shown below.
AEP Field
% Accuracy
Main Verb
80%
Spine
72%
Alignments
88%
For each of the 95 clauses, if the main verb was correct within that clause, I marked the
main verb prediction as accurate. I was conservative in the markings; if it was a stretch to
consider the verb correct, I marked it wrong. For example, one main verb prediction was
pays where it should have been watching, where the full phrase is watching each step. "pays"
(as in paying attention) and "watching" are semantically similar, but pays for this clause is
clearly incorrect. More commonly during incorrect main verb predictions, specific verbs such
as change and make were reduced to commonly predicted verbs such as is. The following
table lists the 95 English main verb predictions, next to the corresponding Chinese main
verb. Note that any feature - not just features of the Chinese main verb - is used by the
model to help predict the English main verb.
Zh verb
Prediction
ik,
held
have
get
continue
taken
is
has
depends
stipulates
protecting
taking
existing
is
stressed
strengthen
face
suffer
raise
create
said
brought
begun
lost
pays
announced
build
is
worked
meet
remain
realize
AL
JIT
_I__
_
R
_
IP
t_
_1_E
10%
1ili
*:t"
IM
Aff
UA
)
(
ia
0 1
Sis
restored
Correction
haven't
free [oneself from]
Zh verb
Prediction
*k
:
dedicated
building
wearing
guarantees
is
is
ended
adopted
promote
exchange
increases
is
expected
think
give
made
accelerate
establishing
developing
take
ask
bringing
combining
is
gave
hope
supporting
hopes
issuing
stressed
have-opportunity
____
$:
_ _
_
:k
C
occured
TIM
_-_
b'm
Aif_~i
AN
danced
unfolded
ii
"d
watching
i'F
transform
committed
_ _
if
;is
)r
Correction
Zh verb
Prediction
Corr.
A
ffW
shows
develop
has
prevent
explains
carrying
f4~i'J
&
Mi
ix.
make, cause
lLB
W
predict
~f
give-play-to
1f
S)JA
to serve as
S
evaluate
JA
04
Wf
is
said
said
emphasizing
join
said
pointed
established
building
said
make
wrote
joining
stood
is
suspended
said
arrived
expressed
reach
enrolled
promote
has
I joined
participate
maintained
establish
has
conduct
ensure
I marked any spine prediction as wrong if there was some part of the spine that seemed
incorrect. For example, if the Chinese clause had a DE complementizer function word, the
English spine should have reflected that accordingly. Surprisingly, the largest source of error
in the spine predictions was an insertion of an and conjunction for Chinese clause inputs that
didn't seem to have any such conjunctions. This is likely because Chinese often combines
several parallel clauses together without conjunctions. During training, even though a given
Chinese clause itself may have no conjunction, the aligned English clause frequently will.
Each clause contains potentially many modifier alignments. The statistic reported treats
each alignment prediction independently. Out of the 95 well-parsed clauses, there were 177
modifier alignments in all. If a Chinese modifier migrated to any correct location on the
English side, I reported it as correct.
It is worth noting that the spine and alignment accuracies are slightly misleading because
the predictions are frequently the common subject-verb-object spine, and Chinese subject -+
English subject, Chinese object -+ English object alignments. These default predictions are
often the right ones, and I'm not sure how often the system lucks out this way, guessing the
default without being well-sensitized to the input example. Nevertheless, given good input,
AEP prediction tends to perform well.
5.2
System Output Walk-Through
From the 150 test clauses evaluated, I selected a few examples to highlight the strengths
and weaknesses of my system. Each example shows a print-out from my code. The AEP
prediction is at the top: main verb, spine, subject, object, wh words, modal verb string, and
Chinese modifier alignments. The input Chinese clause that it was predicted from is shown
next. The final string at the bottom shows the full AEP prediction "assembled:" it puts
each AEP field together to show the final predicted clausal translation, with the Chinese
modifiers left untranslated. 2 Consider the first example:
2
Recall that Chinese modifiers are each translated independently in a later step, using an off-the-shelf
phrase-based translator.
main verb: promote
spine:
S NP-A
VP V
NP-A
subject: "we"
object: Chinese node 19
wh: .
modals string: "should"
alignment:
Chinese node 2: POST VERB
Chinese node 13: IN MODALS
Chinese node 19: already assigned to object
Chinese clause input:
IP-O VP-i QP-2 NP-3 NPB-4 NT-5 4-) [future]
QP-6 CD-7 H [five]
CLP-8 m-9 * [years]
[want]
VP-A-10 W-VVll
VP-A-12 DVP-13 VP-14 VA-15 MtT_ 3 [unswervingly]
DEV-16 A [DE-ADVERBIAL]
VA-A-17 W-18 ~Ej [promote, push forward]
NP-A-19 NPB-20 NN-21 &*
[revolution]
output:
we should "f$
0, promote
5~
H F
literal:
we should [unswervingly] promote [the revolution] [(for) the next 5 years]
What's exciting about this example is that some of the discrepancies between Chinese and
English clause structure are well accounted for. The AEP correctly predicts the modal/main
verb sequence should promote, with the correct tense, likely as a result of features identifying the Chinese modal verb -, literally meaning to want but often meaning should with
appropriate context. More interestingly, the Chinese clause has no subject - it has been
dropped, likely because it's already been made clear in an earlier sentence. To cover this,
the prediction correctly creates an English subject we that has no correspondence on the
Chinese side. The Chinese modifier that means (for) the next 5 years is moved from near
the beginning of the clause to the end, which is perfectly correct.3 The adverbial Chinese
modifier meaning unswervingly is aligned between the English modal verb and main verb
(IN_MODALS position), which is the only place that it could have correctly gone.
It is certainly possible that any subset of these predictions were merely lucky; this demonstration is not statistically compelling. However, I suspect that in the training data, the
Chinese verb meaning to promote often appears without a subject and with the unaligned
English subject we, because in the training data the verb is often used by politicians speaking
3
Both positions - beginning or end of clause - are acceptable in English.
to the audience inclusively. Similarly, because it is uncommon in the examined test output
to assign a Chinese modifier to the IN_MODALS position, I suspect that the trained model
is well-sensitized to adverbial modifier positioning. It is quite common in these examples
for a pre-subject Chinese modifier to move to the post-verb position on the English side.
The (for) the next 5 years modifier is one example. The next example illustrates another,
in which case such movement is the only correct translation:
main verb: bringing
spine:
PP IN for
VP V
NP-A
subject: NULL
object: Chinese node 11
wh: ."
modals string:
alignment:
Chinese node 4: POST VERB
Chinese node 11: already assigned to object
Chinese clause input:
CP-0 IP-1 VP-2 PP-3 P-4 9 [to]
NP-A-5 NPB-6 NN-7 A
[rural]
NN-8 f
[landscape]
VP-A-9 W-10
|*( [bring-about]
NP-A-11 ADJP-12 JJ-13 ;k [tremendous]
NPB-14 NN-15 5tR [changes]
DEC-16 M3 [DE]
output:
for bringing E _
'
f
MA
literal:
for bringing [tremendous changes] [to rural landscape]
The Chinese modifier rooted at node 4, a prepositional phrase meaning to the rural
landscape at the beginning of the clause, correctly moves to the POST_VERB position,
changing the original Chinese word order from [to rural landscape] bring-about tremendous
changes to the more English-like bring-about tremendous changes [to rural landscape]. Also
interesting about this example is that the input is a relative clause, as can be seen by the
presence of the DE function word at the end. This information is incorporated into the spine
prediction, which includes a complementizer function word for at its beginning. Lastly, the
tense of the main verb bringing is likely the only correct tense in a relative clause such as
this: for bringing tremendous changes ...
Again, it is possible that any of these predictions are lucky. Because the spine above is
rarely predicted, however, I suspect that the model is sensitized to the presence of DE at
the end of a Chinese clause.
The next example illustrates both a strength and, separately, a weakness.
main verb: maintained
spine:
S NP-A
VP V
NP-A
subject: Chinese node 4
object: Chinese node 20
wh: ""
modals string: "have"
alignment:
Chinese node 1:
Chinese node 4:
Chinese node 20:
DELETED
already aligned to subject
already aligned to object
Chinese clause input:
IP-0 ADVP-1 AD-2 -4tUWJ*i0
[at the same time]
$,-3,
NP-A-4 NP-5 NPB-6 NR-7 ARE#[Asia]
CCP-8 CC-9 4 [and]
[world]
NP-10 NPB-11 NN-12 [other]
DP-13 DT-14 Aft
NPB-15 NN-16 AEK [places]
VP-17 W-18 94
[protect]
AS-19 W
[ZHE]
NP-A-20 CP-21 IP-22 VP-23 VA-24 %9Z [close]
DEC-25 O
[DE]
NPB-26 NN-27 IRA [connection]
output:
RM}#| 4 t# V A have maintained H; M $
literal:
[Asia and other places in the world] have maintained [close connections]
In this example, the modal/main verb sequence have maintained is dead-on. While
Chinese clauses often give no indication of the correct tense of the verb, the occasional
presence of an aspect function word can give clues. This Chinese example clause contains
the continuous aspect word *after the main verb, indicating continual duration of an action
or state. Features in the model identify these aspect words, perhaps helping to make the
correct main verb prediction. That the verb is additionally translated as past-tense in English
is likely a result of the training corpus consisting of news articles, most of which are reporting
on past events.
That the example deletes a modifier - the modifier at root node 1 - also shows a weakness
of the system. Often times, Chinese modifiers should be deleted from the English side. One
example is the adverb
lTkthat has no clear correspondence in English. However, it is more
frequently the case that during training data extraction, a modifier is deleted because it has
no word alignments to English modifiers. In general, because of word alignment troubles,
modifiers are deleted too frequently in the training data, leading to a slight paraphrasing
problem my system suffers from. In this example, the modifier is a 4-character, single-token
idiom that roughly means at the same time. It should not be deleted.
The final example is an instance of a common failure mode of the system: incorrect
POST_VERB alignment predictions. As shown in the previous three examples, it is common for a pre-subject Chinese modifier to move to the post-verb position in the English
translation. Temporal and prepositional phrases are the most common instances. In this
example, however, POST_VERB alignment ends up ruining the word order:
main verb: ensure
spine:
S NP-A
VP V
subject: Chinese node 4
object: NULL
wh: ""
modals string:
alignment:
Chinese node
Chinese node
Chinese node
Chinese node
1:
4:
6:
35:
DELETED
already assigned to subject
POST VERB
POST_VERB
Chinese clause input:
IP-O ADVP-1 $,-2 ,
AD-3 ftfi
ADVP-4 AD-5 ME
NP-A-6 ADJP-7 jj-8 jAI#
NPB-9 NN-10 DNP-11 ADJP-12 JJ-13 ?J
DEG-14 M
NP-15 LCP-16 NP-17 NPB-18 NN-19 ",ME
NN-20
LC-24 ±
CPP-21 CC-22 I
NN-23 M
DNP-25 NPB-26 NN-27 ~jf
NN-28 ifff
DEG-29 M3
NPB-30 NN-31 9
$,-32 ,
VP-33 VV-34 f
NP-A-35 NP-36 DNP-37 NPB-38 NN-39 P94
DEG-40 J
NPB-41 NN-42 NjM
NN-43 1A
NN-44
NPB-45 NN-46 *
IP-47 [clause attachment point]
output:
1EA will ensure M±
M fJ St
,
o
WEEM M tt
f M &
MI %
literal:
[is]will ensure [online news and information publisher]
[this sort high degree DE timeliness and net news speed competition]
Fel
At the end of the clause is an attachment position for a lower clause, roughly of the
meaning make the news publisher [lower clause]. This is reorganized as make news publisher
<long modifier 1> <long modifier 2> [lower clause], obscuring the connection between the
news publisher and its lower clause. The subject of the sentence is also incorrectly identified
as a Chinese single-word adverbial modifier, further contributing to a nonsensical result.
Further Work
5.3
While AEP prediction itself has shown promising results, the system is currently far from
being competitive with the state of the art. Substantial work in designing a modifier insertion strategy and a clause gluing strategy is necessary before it could be used to translate
sentences. Sections 5.3.1 and 5.3.2 review further work in these areas, respectively. The
system is also naturally quite sensitive to Chinese parsing and segmentation errors, as these
technologies lie at its foundation. Section 5.3.3 summarizes this issue.
5.3.1
The Need for a Modifier Insertion Strategy
Recall that clausal modifiers - subject, object and adjuncts - are each translated independently using a phrase-based translator. The strength of this approach is that phrase-based
systems, while ineffective at long-distance word reordering, tend do an excellent job translating content correctly and learning the local relationships between words. A weakness,
however, is that it is hard to make the independently translated modifiers agree with their
outer clausal context. Consider two similar sentence translations. Modifiers are delimited
by square brackets:
[talks] propelled [in the economic and technologicalcooperation between China and
Japan] and stimulated [future collaboration], reaching[new level].
[the talks] propelled [economic and technological cooperation between China and
Japan] and stimulated [future collaboration],reaching[a new level].
In the first translation, the modifiers for the most part do not agree with their outer clausal
context: in the economic and technological cooperation between China and Japan, for example, is a perfectly fluent Noun phrase, but an invalid object for the verb propelled. The
second translation has similar modifiers that additionally fit within their clauses.
Section 4.1.4 describes the current modifier insertion placeholder implementation that
simply inserts the best modifier translation without regard to how it fits within its clause.
In their German-to-English system, Cowan et al. [1] took the n-best list of each modifier
translation and used a discriminative reranking algorithm [11] to choose between them.
Other approaches are certainly possible. In her 2008 dissertation, Cowan derives an improved
approach based on finite state machines for effectively choosing modifier translations.
5.3.2
The Need for a Clause Gluing Strategy
A substantial problem that has not been the focus of this thesis is, after the clauses of a
Chinese sentence have each been independently translated into English, how should they be
reassembled (glued) to form the full sentence? In German, the clauses often attach from left
to right, making it easy to combine clauses: simply concatenate them. Chinese, on the other
hand, is not so easy - it is much easier for a sentence to have embedded clause structure.
Consider the following Chinese clause as an example:
IP-O LCP-1 NP-2 DNP-3 NPB-4 NN-5 !
DEG-6
QP-7 OD-8
J
--
CLP-9 M-10
J
NPB-11 NN-12 E P
NN-13
LC-14
#
(M*
$,-15 ,
NP-A-16 CP-17 [clause attachment point]
NPB-18 NN-19 fjfj
VP-A-23 VV-24 A/#
NP-A-25 CP-26 [clause attachment point]
NPB-27 NN-28
_,t
It has two places that lower clauses attach: one that modifies the subject WW, and one
that modifies the object -3V{t. In the translation into English, putting this high-level clause
and its two subclauses together is not a straight-forward task. A simple strategy such as
left-to-right concatenation clearly won't work. The placeholder implementation - depth-first
clause gluing, described briefly in 4.1.4 - will fail just as badly on this example.
Chinese subject and object relative clauses are the most common source of this nested
behavior. One might think that an easy improvement would be to detect such cases, and
attach the translated relative clauses to the English subject and object accordingly. The
problem is that the Chinese subject and object don't always correspond to the English
subject and object. More thought is needed.
One possible extension would be to augment the AEP formalism to include subclause
attachment positions, such that the model explicitly predicts the places on an English clause
where translated English subclauses attach. A rich set of features for analyzing subclause
attachment positions could easily be added to the model.
5.3.3
Chinese Segmentation and Parsing
Section 5.1 evaluates AEP prediction accuracy only over test clauses that were parsed correctly. However, about a third of the 150 clauses that I inspected were bad parses at the
clausal level. Such parsing inaccuracy severely reduces the real-world feasibility of a tree-totree translation system. That Chinese must first be segmented additionally creates an early
source of error that can easily lead the parser, and all subsequent stages, to failure.
My system therefore directly depends on the further development of core Chinese segmenting and parsing technology. The good news is that parsing and segmenting are hot
research areas and continue to improve; part of the inaccuracy in parsing Chinese stems
from the fact that the parsing community hasn't yet given it as much attention as English.
Chapter 6
Contributions
Taking Cowan's work as my foundation, I designed, built and presented a clausal translation
system for Chinese-to-English and explored its effectiveness. I used Aligned Extended Projections to represent the relationship between Chinese and English clause structure. The core
of the system is a discriminative feature-based model, trained on the perceptron algorithm,
that predicts an English AEP from a Chinese clause.
Training data extraction - processing a parallel corpus of Chinese and English sentences
into a set of training examples - constituted a substantial part of my contribution. I presented
algorithms for Chinese clause splitting, clause alignment, and AEP extraction.
Experimentation on real data revealed both impressive performance of the AEP prediction engine itself, and the severity of three categories of errors that currently hold the system
back from pushing the state of the art: errors in modifier insertion, in clause gluing, and
in Chinese parsing and segmentation. Modifier insertion and clause gluing are substantial
problems in and of themselves; they weren't focused on in this thesis but would be a great
target for another. Chinese parsing and segmentation improve each year, and will continually
make syntactic modelling - including clausal translation - more powerful.
By explicitly modeling syntax at a clausal level, but using a phrase-based system on
clausal modifiers such as subject and object, my system sought to improve upon the state
of the art's current weakness in long distance word reordering while still maintaining the
excellent content translation that direct transfer models have to offer.
As this thesis centers around (machine) learning, I close with an appropriate quotation
from Chairman Mao:
Chinese
Literal
good good study, day day up
English
Study Hard, Make Progress Every Day
Bibliography
[1] Brooke Cowan, Ivona Kuierod, and Michael Collins. A discriminative model for treeto-tree translation. In EMNLP 2006, 2006.
[2] Philipp Koehn, Franz J. Och and Daniel Marcu. Statistical phrase based translation. In
HLT/NAACL 03, 2003.
[3] Robert Frank. Phrase Structure Composition and Syntactic Dependencies. Cambridge,
MA: MIT Press, 2002.
[4] Stuart M. Shieber, Yves Schabes. Synchronous tree-adjoining grammars. In Proceedings
of the 13th International Conference on Computational Linguistics, 1990.
[5] Warren Weaver Memorandum, July 1949. MT News International, no. 22, July 1999,
pp.5-6, 15.
[6] John Hutchins. The Georgetown-IBM Demonstration, 7th January 1954. MT News
International, no.8,May 1994, pp. 15-18
[7] Franz J. Och, Hermann Ney. A systematic comparison of various statistical alignment
models. In Computational Linguistics, 29(1):19-51, 2003.
[8] Peter Brown, Stephen Della Pietra, Vincent Della Pietra, Robert Mercer. The Mathematics of Statistical Machine Translation: Parameter Estimation. In Computational
Linguistics, 1994.
[9] Michael Collins and Brian Roark. Incremental parsing with the perceptron algorithm.
In ACL 2004.
[10] Michael Collins. Head-Driven Statistical Models for Natural Language Processing. University of Pennsylvania, 1999.
[11] Peter Bartlett, Michael Collins, Ben Taskar, and David McAllester. Exponential gradient algorithms for large-margin structured classification. In Proceedings of NIPS, 2004.
[12] Michael Collins. Discriminative Training Methods for Hidden Markov Models: Theory
and Experiments with Perceptron Algorithms. In EMLNP 02, 2002.
[13] Franz Och. Statistical Machine Translation Live. Google Research Blog post. Friday,
April 28, 2006.
[14] Nianwen Xue, Fei Xia, Fu-Dong Chiou, and Martha Palmer. The Penn Chinese TreeBank: Phrase Structure Annotation of a Large Corpus. Natural Language Engineering,
11(2)207-238, 2005.
[15] Jonathan Graehl and Kevin Knight. Training Tree Transducers. In NAACL-HLT, 2004
[16] David Chiang. Hierarchical Phrase-Based Translation. 2007 ComputationalLinguistics
33(2):201-228.
[17] Kenji Yamada and Kevin Knight. A Syntax-Based Statistical Translation Model. In
A CL 2001.
[18] Fei Xia and Michael McCord. Improving a statistical MT system with automatically
learned rewrite patterns. In COLING 2004.
[19] Daniel Marcu, Wei Wang, Abdessamad Echihabi, and Kevin Knight. SPMT: Statistical Machine Translation with Syntactified Target Language Phrases. Proceedings of
EMNLP-2006, pp. 44-52, Sydney, Australia.
[20] Bonnie Dorr, Pamela Jordan and John Benoit. A Survey of Current Research in Machine Translation. Advances in Computers, Vol 49, M. Zelkowitz (Ed), Academic Press,
London, pp. 1-68, 1999
[21] Percy Liang, Alexandre Bouchard-Cote, Dan Klein, and Ben Taskar. An End-to-End
Discriminative Approach to Machine Translation. In COLING-ACL 2006.
[22] Philipp Koehn and Hieu Hoang. Factored Translation Models. EMNLP 2007.
[23] Arthur Dempster, Nan Laird, and Donald Rubin. Maximum Likelihood from Incomplete
Data Via the EM Algorithm. Journal of the Royal Statistical Society, Series B, 39(1):138, 1977
[24] Kevin Knight and Philipp Koehn. What's New in Statistical Machine Translation. Tutorial at HLT/NAACL 2004.
[25] Rebecca Nesson and Stuart Shieber. Extraction Phenomena in Synchronous TAG Syntax and Semantics. In Dekai Wu and David Chiang, editors, Proceedings of the Workshop
on Syntax and Structure in Statistical Translation, Rochester, New York, April 2007.