Combining Word-Alignment
Symmetrizations in
Dependency Tree Projection
David Mareček
marecek@ufal.mff.cuni.cz
Charles University in Prague
Institute of Formal and Applied Linguistics
CICLING conference
Tokyo, Japan, February 21, 2011
Motivation
Let’s have a text in a language which is not very common...
We would like to parse it, but we do not have any parser
no manually annotated treebank
But we do have a parallel corpus with another language
English
Our goal – To create a parser
Take the parallel corpus with English
Make a word-alignment on it
GIZA++
Parse the English side of the corpus
MST dependency parser
Transfer the dependencies from English to the target language using
the word-alignment
Train the parser on the resulting trees
Previous works
Rebecca Hwa (2002, 2005)
Simple algorithm for projecting trees from English to Spanish and
Chinesse
Only one type of alignment used and not specified which one
K. Ganchev, J. Gillenwater, B. Taskar (2009)
Unsuprevised parser with posterior regularization, in which inferred
dependencies should correspond to projected ones
English to Bulgarian
Our contribution
To show that utilization of various types of alignment improves the
quality of dependency projection
GIZA++ [Och and Ney, 2003]
two uni-directonal asymmetric alignments
symmetrization methods
Simple algorithm for projecting dependencies using different types of
alignment links
Training and evaluating MST parser
Word alignment
GIZA++ toolkit has asymmetric output
For each word in one language just one counterpart from the other
language is found
Coordination
of
fiscal
policies
indeed
,
can
be
counterproductive
.
Eine Koordination finanzpolitischer Maßnahmen kann in der Tat kontraproduktiv sein .
ENGLISH-to-X
Coordination
of
fiscal
policies
indeed
,
can
be
counterproductive
.
Eine Koordination finanzpolitischer Maßnahmen kann in der Tat kontraproduktiv sein .
X-to-ENGLISH
Symmetrization methods
Combinations of previous two unidirectional alignments
Coordination
of
fiscal
policies
indeed
,
can
be
counterproductive
.
Eine Koordination finanzpolitischer Maßnahmen kann in der Tat kontraproduktiv sein .
INTERSECTION
Coordination
of
fiscal
policies
indeed
,
can
be
counterproductive
.
Eine Koordination finanzpolitischer Maßnahmen kann in der Tat kontraproduktiv sein .
GROW-DIAG-FINAL
Which alignment to use for the projection?
We have presented four different types of alignment
ENGLISH-to-X, X-to-ENGLISH, INTERSECTION, GROW-DIAG-FINAL
We prefer X-to-ENGLISH alignment
we need to find a parent for each token in the language X
we don’t mind English words that are not aligned
We recognize three types of links
A: links that appeared in INTERSECTION alignment (red)
B: links that appeared in GROW-DIAG-FINAL and also in X-to-ENGLISH
alignment (orange)
C: links that appeared only in X-to-ENGLISH alignment (blue)
Coordination
of
fiscal
policies
indeed
,
can
be
counterproductive
.
Eine Koordination finanzpolitischer Maßnahmen kann in der Tat kontraproduktiv sein .
Algorithm - example
Coordination
of
fiscal
policies
indeed
,
can
be
counterproductive
.
Eine Koordination finanzpolitischer Maßnahmen kann in der Tat kontraproduktiv sein .
Results
The best results for each of the testing languages:
English parser trained on CoNLL-X data
The projection was made on first 100.000 sentence pairs from Newscommentaries (or Acquis-communautaire) parallel corpus
We used McDonald’s maximum spaning tree parser
Language
Parallel Corpus
Testing Data
Accuracy
Bulgarian
Acquis
CoNLL-X
52.7 %
Czech
News
CoNLL-X
62.0 %
Dutch
Acquis
CoNLL-X
52.4 %
German
News
CoNLL-X
55.7 %
Why is the accuracy so low?
Treebanks in CoNLL differ in annotation guidelines
Different handling of coordination structures, auxiliary verbs, noun
phrases, ...
Comparison with previous work
We have run our projection method on the same datasets as in the
previous work by Ganchev et al. (2009)
Bulgarian, OpenSubtitles parallel corpus
English parser trained on PennTreebank
Tested on Bulgarian CoNLL-X train sentences up to 10 words
Method
Parser
Accuracy
Ganchev et al.
Discriminative model 66.9 %
Ganchev et al.
Generative model
67.8 %
Our method
MST parser
68.1 %
Our results are slightly better
we did NOT use any unsupervised inference of dependency edges
we utilized better the word aligment
Conclusions
We proved that using combination of different word-alignment
improves dependency tree projection
We outperform the state-of-the art results
The problem of testing is in a different anotation guidelines for each
treebank
Thank you for your attention