Human Judgements in Parallel Treebank Alignment Martin Volk, Torsten Marek, Yvonne Samuelsson University of Zurich and Stockholm University volk@cl.uzh.ch English Syntax Tree 23 August 2008 2 23 August 2008 3 23 August 2008 4 DE – EN Alignment 23 August 2008 5 SMULTRON Stockholm MULtilingual TReebank 1000 sentences in 3 languages (DE-EN-SV) 500 from Jostein Gaarder’s Sophie’s World (~ 7 500 tokens, 14 tokens/sentence) and 500 from Economy texts (~ 11 000 tokens, 22 tokens/sentence) ABB Quarterly report Rainforest Alliance: Banana Certification Program SEB Annual report Released: January 2008 www.ling.su.se/dali/research/smultron/index.htm 23 August 2008 6 German Annotation 23 August 2008 7 German sentence: flat annotation 23 August 2008 8 German sentence: deepened 23 August 2008 9 English Annotation 23 August 2008 10 English Syntax Tree 23 August 2008 11 English annotation Follows the Penn Treebank guidelines Slower annotation because of insertion of traces secondary edges deeper trees 23 August 2008 12 23 August 2008 13 Tree Alignment 23 August 2008 14 Sentence alignment Word alignment input for Statistical MT Phrase alignment linguistically motivated phrases input for Example-based MT 23 August 2008 15 Alignment Example 23 August 2008 16 Tools for Parallel Treebanks creating and editing trees from mono-lingual treebanks PoS-taggers, chunkers, editor, ’tree-enricher’ aligning phrases use of word alignment tools tree alignment editor Stockholm TreeAligner searching across languages TIGER-Search for parallel treebanks Stockholm TreeAligner 23 August 2008 17 Guidelines for Alignment 1. 2. 3. 4. 5. Align words and phrases that represent the same meaning and could serve as translation units in an MT system. Align as many words and phrases as possible. Distinguish between exact and approximate alignments. 1:n word / phrase alignments are allowed, but not m:n word / phrase alignments. m:n sentence alignments are allowed. 23 August 2008 18 Examples Do not align: die Verwunderung über das Leben their astonishment at the world Do align: was für eine seltsame Welt what an extraordinary world 23 August 2008 19 Specific rules a pronoun in one language shall never be aligned with a full noun in the other names are aligned regardless of spelling, unless the name is changed (fiction) ignore number/case but not voice 23 August 2008 20 Exact vs approximate alignment best vs. ”second-best” translation an acronym in one language shall be aligned as approximate (fuzzy) with a spelled-out term in the other PT – Power Technologies difficult distinctions einer 23 August 2008 der ersten Tage im Mai – early May 21 Related Research Blinker project (Melamed) Prague Czech-English Treebank Example-based MT in Dublin Linköping English-Swedish Treebank 23 August 2008 22 Experiment 12 students to align 20 tree pairs DE-EN 10 tree pairs from Sophie’s world 10 tree pairs from Economy text advanced CL students received short introduction the written guidelines 23 August 2008 23 Gold Standard Alignment (DE-EN) word - word phrase - phrase exact approx. exact approx. 10 sent. Sophie 75 3 46 12 10 sent. Econ 159 23 August 2008 78 58 19 178 62 9 71 24 Experiment: Results The students created a huge variety in number of alignments Sophie part: from 47 to 125 (ø = 94.3) Econ part: from 62 to 259 (ø = 186.9) the 3 students with the lowest numbers were non-native speakers of German 1 student had misunderstood the task 23 August 2008 25 Experiment: Results The remaining 8 students had a high overlap with the gold standard (Recall): Sophie part: Econ part: from 48% to 81% (ø = 68.7%) from 66% to 89% (ø = 75.5%) Precision Sophie part: Econ part: 23 August 2008 from 81% to 97% (ø = 89.1%) from 78% to 94% (ø = 88.2%) 26 Discrepancies students sometimes aligned a word (or some words) with a node. e.g. the word natürlich to the phrase of course students sometimes aligned a German verb group with a single verb form in English e.g. 23 August 2008 ist zurückzuführen vs. reflecting 27 Discrepancies based on different grammatical forms: a definite single NP in German with an indefinite plural NP in English der Umsatz vs. revenues a German genitive NP with a PP in English der 23 August 2008 beiden Divisionen vs. of the two divisions 28 Missed by all students alignment of German word to empty token in English wenn sie die Hand ausstreckte vs. herself shaking hands 23 August 2008 29 23 August 2008 30 Conclusions 1. 2. Our alignment guidelines are sufficient for a core of clear alignment decisions. Needed: Better alignment rules with concrete examples. 2. Better support tools (consistency checking). 1. 3. The distinction between exact alignment and approximate alignment is very tricky. 23 August 2008 31 Thank You for Your Attention! Questions??? 23 August 2008 32 Applications of Parallel Treebanks For the Translator 1. corpus for translation studies search tools needed For the Computational Linguist 2. input for Example-based Machine Translation 3. evaluation corpus for word, phrase or clause alignment 4. training corpus for transfer rules 23 August 2008 33 Alignment Example 23 August 2008 34 Parallel Treebanking SV sentence DE sentence ANNOTATE - PoS tagger (STTS) - Chunker (TIGER) flat DE tree flat SV tree Deepening DE tree 23 August 2008 PoS tagger (SUC) STTS conversion ANNOTATE - Chunker (SWE-TIGER) Deepening + Back conv. phrase alignment SV tree 35