Word alignment - Institute of Formal and Applied Linguistics

Combining Word-Alignment Symmetrizations in Dependency Tree Projection David Mareček marecek@ufal.mff.cuni.cz Charles University in Prague Institute of Formal and Applied Linguistics CICLING conference Tokyo, Japan, February 21, 2011 Motivation   Let’s have a text in a language which is not very common... We would like to parse it, but we do not have any parser   no manually annotated treebank But we do have a parallel corpus with another language  English Our goal – To create a parser  Take the parallel corpus with English  Make a word-alignment on it   GIZA++ Parse the English side of the corpus  MST dependency parser  Transfer the dependencies from English to the target language using the word-alignment  Train the parser on the resulting trees Previous works  Rebecca Hwa (2002, 2005)  Simple algorithm for projecting trees from English to Spanish and Chinesse  Only one type of alignment used and not specified which one  K. Ganchev, J. Gillenwater, B. Taskar (2009)  Unsuprevised parser with posterior regularization, in which inferred dependencies should correspond to projected ones  English to Bulgarian Our contribution  To show that utilization of various types of alignment improves the quality of dependency projection  GIZA++ [Och and Ney, 2003]  two uni-directonal asymmetric alignments  symmetrization methods  Simple algorithm for projecting dependencies using different types of alignment links  Training and evaluating MST parser Word alignment  GIZA++ toolkit has asymmetric output  For each word in one language just one counterpart from the other language is found Coordination of fiscal policies indeed , can be counterproductive . Eine Koordination finanzpolitischer Maßnahmen kann in der Tat kontraproduktiv sein . ENGLISH-to-X Coordination of fiscal policies indeed , can be counterproductive . Eine Koordination finanzpolitischer Maßnahmen kann in der Tat kontraproduktiv sein . X-to-ENGLISH Symmetrization methods  Combinations of previous two unidirectional alignments Coordination of fiscal policies indeed , can be counterproductive . Eine Koordination finanzpolitischer Maßnahmen kann in der Tat kontraproduktiv sein . INTERSECTION Coordination of fiscal policies indeed , can be counterproductive . Eine Koordination finanzpolitischer Maßnahmen kann in der Tat kontraproduktiv sein . GROW-DIAG-FINAL Which alignment to use for the projection?  We have presented four different types of alignment   ENGLISH-to-X, X-to-ENGLISH, INTERSECTION, GROW-DIAG-FINAL We prefer X-to-ENGLISH alignment  we need to find a parent for each token in the language X  we don’t mind English words that are not aligned  We recognize three types of links   A: links that appeared in INTERSECTION alignment (red) B: links that appeared in GROW-DIAG-FINAL and also in X-to-ENGLISH alignment (orange)  C: links that appeared only in X-to-ENGLISH alignment (blue) Coordination of fiscal policies indeed , can be counterproductive . Eine Koordination finanzpolitischer Maßnahmen kann in der Tat kontraproduktiv sein . Algorithm - example Coordination of fiscal policies indeed , can be counterproductive . Eine Koordination finanzpolitischer Maßnahmen kann in der Tat kontraproduktiv sein . Results  The best results for each of the testing languages:  English parser trained on CoNLL-X data  The projection was made on first 100.000 sentence pairs from Newscommentaries (or Acquis-communautaire) parallel corpus  We used McDonald’s maximum spaning tree parser  Language Parallel Corpus Testing Data Accuracy Bulgarian Acquis CoNLL-X 52.7 % Czech News CoNLL-X 62.0 % Dutch Acquis CoNLL-X 52.4 % German News CoNLL-X 55.7 % Why is the accuracy so low?   Treebanks in CoNLL differ in annotation guidelines Different handling of coordination structures, auxiliary verbs, noun phrases, ... Comparison with previous work  We have run our projection method on the same datasets as in the previous work by Ganchev et al. (2009)  Bulgarian, OpenSubtitles parallel corpus  English parser trained on PennTreebank  Tested on Bulgarian CoNLL-X train sentences up to 10 words  Method Parser Accuracy Ganchev et al. Discriminative model 66.9 % Ganchev et al. Generative model 67.8 % Our method MST parser 68.1 % Our results are slightly better   we did NOT use any unsupervised inference of dependency edges we utilized better the word aligment Conclusions  We proved that using combination of different word-alignment improves dependency tree projection  We outperform the state-of-the art results  The problem of testing is in a different anotation guidelines for each treebank Thank you for your attention

Word alignment - Institute of Formal and Applied Linguistics

Related documents

Products

Support

Word alignment - Institute of Formal and Applied Linguistics

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib