Word alignment - Institute of Formal and Applied Linguistics

advertisement
Combining Word-Alignment
Symmetrizations in
Dependency Tree Projection
David Mareček
marecek@ufal.mff.cuni.cz
Charles University in Prague
Institute of Formal and Applied Linguistics
CICLING conference
Tokyo, Japan, February 21, 2011
Motivation


Let’s have a text in a language which is not very common...
We would like to parse it, but we do not have any parser


no manually annotated treebank
But we do have a parallel corpus with another language

English
Our goal – To create a parser

Take the parallel corpus with English

Make a word-alignment on it


GIZA++
Parse the English side of the corpus

MST dependency parser

Transfer the dependencies from English to the target language using
the word-alignment

Train the parser on the resulting trees
Previous works

Rebecca Hwa (2002, 2005)

Simple algorithm for projecting trees from English to Spanish and
Chinesse
 Only one type of alignment used and not specified which one

K. Ganchev, J. Gillenwater, B. Taskar (2009)

Unsuprevised parser with posterior regularization, in which inferred
dependencies should correspond to projected ones
 English to Bulgarian
Our contribution

To show that utilization of various types of alignment improves the
quality of dependency projection

GIZA++ [Och and Ney, 2003]

two uni-directonal asymmetric alignments
 symmetrization methods

Simple algorithm for projecting dependencies using different types of
alignment links

Training and evaluating MST parser
Word alignment

GIZA++ toolkit has asymmetric output

For each word in one language just one counterpart from the other
language is found
Coordination
of
fiscal
policies
indeed
,
can
be
counterproductive
.
Eine Koordination finanzpolitischer Maßnahmen kann in der Tat kontraproduktiv sein .
ENGLISH-to-X
Coordination
of
fiscal
policies
indeed
,
can
be
counterproductive
.
Eine Koordination finanzpolitischer Maßnahmen kann in der Tat kontraproduktiv sein .
X-to-ENGLISH
Symmetrization methods

Combinations of previous two unidirectional alignments
Coordination
of
fiscal
policies
indeed
,
can
be
counterproductive
.
Eine Koordination finanzpolitischer Maßnahmen kann in der Tat kontraproduktiv sein .
INTERSECTION
Coordination
of
fiscal
policies
indeed
,
can
be
counterproductive
.
Eine Koordination finanzpolitischer Maßnahmen kann in der Tat kontraproduktiv sein .
GROW-DIAG-FINAL
Which alignment to use for the projection?

We have presented four different types of alignment


ENGLISH-to-X, X-to-ENGLISH, INTERSECTION, GROW-DIAG-FINAL
We prefer X-to-ENGLISH alignment

we need to find a parent for each token in the language X
 we don’t mind English words that are not aligned

We recognize three types of links


A: links that appeared in INTERSECTION alignment (red)
B: links that appeared in GROW-DIAG-FINAL and also in X-to-ENGLISH
alignment (orange)
 C: links that appeared only in X-to-ENGLISH alignment (blue)
Coordination
of
fiscal
policies
indeed
,
can
be
counterproductive
.
Eine Koordination finanzpolitischer Maßnahmen kann in der Tat kontraproduktiv sein .
Algorithm - example
Coordination
of
fiscal
policies
indeed
,
can
be
counterproductive
.
Eine Koordination finanzpolitischer Maßnahmen kann in der Tat kontraproduktiv sein .
Results

The best results for each of the testing languages:

English parser trained on CoNLL-X data
 The projection was made on first 100.000 sentence pairs from Newscommentaries (or Acquis-communautaire) parallel corpus
 We used McDonald’s maximum spaning tree parser

Language
Parallel Corpus
Testing Data
Accuracy
Bulgarian
Acquis
CoNLL-X
52.7 %
Czech
News
CoNLL-X
62.0 %
Dutch
Acquis
CoNLL-X
52.4 %
German
News
CoNLL-X
55.7 %
Why is the accuracy so low?


Treebanks in CoNLL differ in annotation guidelines
Different handling of coordination structures, auxiliary verbs, noun
phrases, ...
Comparison with previous work

We have run our projection method on the same datasets as in the
previous work by Ganchev et al. (2009)

Bulgarian, OpenSubtitles parallel corpus
 English parser trained on PennTreebank
 Tested on Bulgarian CoNLL-X train sentences up to 10 words

Method
Parser
Accuracy
Ganchev et al.
Discriminative model 66.9 %
Ganchev et al.
Generative model
67.8 %
Our method
MST parser
68.1 %
Our results are slightly better


we did NOT use any unsupervised inference of dependency edges
we utilized better the word aligment
Conclusions

We proved that using combination of different word-alignment
improves dependency tree projection

We outperform the state-of-the art results

The problem of testing is in a different anotation guidelines for each
treebank
Thank you for your attention
Download