slides

advertisement
Human Judgements in
Parallel Treebank
Alignment
Martin Volk, Torsten Marek, Yvonne Samuelsson
University of Zurich and Stockholm University
volk@cl.uzh.ch
English Syntax Tree
23 August 2008
2
23 August 2008
3
23 August 2008
4
DE – EN
Alignment
23 August 2008
5
SMULTRON


Stockholm MULtilingual TReebank
1000 sentences in 3 languages (DE-EN-SV)
 500
from Jostein Gaarder’s Sophie’s World
(~ 7 500 tokens, 14 tokens/sentence) and
 500 from Economy texts
(~ 11 000 tokens, 22 tokens/sentence)



ABB Quarterly report
Rainforest Alliance: Banana Certification Program
SEB Annual report
 Released: January 2008
www.ling.su.se/dali/research/smultron/index.htm
23 August 2008
6
German Annotation
23 August 2008
7
German sentence: flat annotation
23 August 2008
8
German sentence: deepened
23 August 2008
9
English Annotation
23 August 2008
10
English Syntax Tree
23 August 2008
11
English annotation
Follows the Penn Treebank guidelines
 Slower annotation because of

 insertion
of traces
 secondary edges
 deeper trees
23 August 2008
12
23 August 2008
13
Tree Alignment
23 August 2008
14
Sentence alignment
 Word alignment

 input

for Statistical MT
Phrase alignment
 linguistically
motivated phrases
 input for Example-based MT
23 August 2008
15
Alignment Example
23 August 2008
16
Tools for Parallel Treebanks

creating and editing trees
 from
mono-lingual treebanks
 PoS-taggers, chunkers, editor, ’tree-enricher’

aligning phrases
 use
of word alignment tools
 tree alignment editor  Stockholm TreeAligner

searching across languages
 TIGER-Search
for parallel treebanks  Stockholm
TreeAligner
23 August 2008
17
Guidelines for Alignment
1.
2.
3.
4.
5.
Align words and phrases that represent the
same meaning and could serve as translation
units in an MT system.
Align as many words and phrases as possible.
Distinguish between exact and approximate
alignments.
1:n word / phrase alignments are allowed, but
not m:n word / phrase alignments.
m:n sentence alignments are allowed.
23 August 2008
18
Examples

Do not align:
 die
Verwunderung über das Leben
 their astonishment at the world

Do align:
 was
für eine seltsame Welt
 what an extraordinary world
23 August 2008
19
Specific rules
a pronoun in one language shall never be
aligned with a full noun in the other
 names are aligned regardless of spelling,
unless the name is changed (fiction)
 ignore number/case but not voice

23 August 2008
20
Exact vs approximate alignment
best vs. ”second-best” translation
 an acronym in one language shall be
aligned as approximate (fuzzy) with a
spelled-out term in the other

 PT

– Power Technologies
difficult distinctions
 einer
23 August 2008
der ersten Tage im Mai – early May
21
Related Research
Blinker project (Melamed)
 Prague Czech-English Treebank
 Example-based MT in Dublin
 Linköping English-Swedish Treebank

23 August 2008
22
Experiment

12 students to align 20 tree pairs DE-EN
 10
tree pairs from Sophie’s world
 10 tree pairs from Economy text
advanced CL students
 received

 short
introduction
 the written guidelines
23 August 2008
23
Gold Standard Alignment (DE-EN)
word - word
phrase - phrase
exact
approx.
exact
approx.
10 sent.
Sophie
75
3
46
12
10 sent.
Econ
159
23 August 2008
78
58
19
178
62
9
71
24
Experiment: Results
The students created
 a huge variety in number of alignments
 Sophie part: from 47 to 125 (ø = 94.3)
 Econ part:
from 62 to 259 (ø = 186.9)
 the 3 students with the lowest numbers
were non-native speakers of German
 1 student had misunderstood the task
23 August 2008
25
Experiment: Results

The remaining 8 students had a high
overlap with the gold standard (Recall):
 Sophie
part:
 Econ part:

from 48% to 81% (ø = 68.7%)
from 66% to 89% (ø = 75.5%)
Precision
 Sophie
part:
 Econ part:
23 August 2008
from 81% to 97% (ø = 89.1%)
from 78% to 94% (ø = 88.2%)
26
Discrepancies

students sometimes aligned a word (or
some words) with a node.
 e.g.

the word natürlich to the phrase of course
students sometimes aligned a German
verb group with a single verb form in
English
 e.g.
23 August 2008
ist zurückzuführen vs. reflecting
27
Discrepancies
based on different grammatical forms:
 a definite single NP in German with an
indefinite plural NP in English
 der

Umsatz vs. revenues
a German genitive NP with a PP in English
 der
23 August 2008
beiden Divisionen vs. of the two divisions
28
Missed by all students

alignment of German word to empty token
in English
 wenn
sie die Hand ausstreckte vs.
 herself shaking hands
23 August 2008
29
23 August 2008
30
Conclusions
1.
2.
Our alignment guidelines are sufficient
for a core of clear alignment decisions.
Needed:
Better alignment rules with concrete
examples.
2. Better support tools (consistency checking).
1.
3.
The distinction between exact alignment
and approximate alignment is very tricky.
23 August 2008
31
Thank You for Your Attention!

Questions???
23 August 2008
32
Applications of Parallel Treebanks
For the Translator
1.
corpus for translation studies

search tools needed
For the Computational Linguist
2.
input for Example-based Machine
Translation
3.
evaluation corpus for word, phrase
or clause alignment
4.
training corpus for transfer rules
23 August 2008
33
Alignment Example
23 August 2008
34
Parallel Treebanking
SV sentence
DE sentence
ANNOTATE
- PoS tagger (STTS)
- Chunker (TIGER)
flat DE tree
flat SV tree
Deepening
DE tree
23 August 2008
PoS tagger (SUC)
STTS conversion
ANNOTATE
- Chunker (SWE-TIGER)
Deepening + Back conv.
phrase alignment
SV tree
35
Download