Catch the Link! Combining Clues for Word Alignment Jörg Tiedemann Uppsala University

advertisement
Catch the Link!
Combining Clues for Word Alignment
Jörg Tiedemann
Uppsala University
joerg@stp.ling.uu.se
Outline
Background
What do we want?
What do we have?
What do we need?
Clue Alignment
What is a clue?
How do we find clues?
How do we use clues?
What do we get?
What do we want?
Source
automatically
Translation 1
Translation 2
Aligned
corpus
Word
aligner
Token
links
language independent
Type
links
Sentence
aligner
Parallel
corpus
What do we have?
tokeniser (ca 99%)
POS tagger (ca 96%)
lemmatiser (ca 99%)
shallow parser (ca 92%), parser (> 80%)
sentence aligner (ca 96%)
word aligner
75% precision
45% recall
What’s the problem with
(1) Alsop says,
"I have a horror of the bad American
Word
Alignment?
practice
of choosing up sides
people's
politics, ..."
(1) Neutralitetspolitiken
stödsinavother
ett starkt
försvar
(2) Alsop
"Jagoberoende.
fasar för den amerikanska ovanan
tillförklarar:
värn för vårt
(1) Armén kommer att reformeras och
att
sida
i andra
människors
politik, ...”by a
(2) välja
Our policy
of neutrality
is underpinned
Word
alignment
challenges:
effektiviseras.
(1) I take
the middle
seat, which I dislike, but
strong
defence.
(2) non-linear
The Iarmy
will
bemapping
reorganized
withback:
the aaim
of account”)
(Saul
Bellow
“To
Jerusalem
personal
am not
really
put
out. and
making
it more vilket
effective.
(The Declarations
ofdifferences
the Swedish
(2) grammatical/lexical
Jag tar
mittplatsen,
jag
inte
tyckerGovernment,
om, men 1988)
görismig
intelate
så mycket.
(1) Our det
Hasid
in his
twenties.
(The Declarations
translation
gaps of the Swedish Government, 1988)
(2) Vår chassid
är bortåt
de trettio.
(Saul
Bellow
“To
Jerusalem
translation extensionsand back: a personal account”)
(Saul Bellow
“To Jerusalem and back: a personal account”)
idiomatic
expressions
multi-word equivalences
So what?
What are the real problems?
Word alignment
uses simple, fixed tokenisation
fails to identify appropriate translation units
ignores contextual dependencies
ignores relevant linguistic information
uses poor morphological analyses
What do we need?
flexible tokenisation
possible multi-word units
linguistic tools for several languages
integration of linguistic knowledge
combination of knowledge resources
alignment in context
Let’s go!
Clue Alignment!
• finding clues
• combining clues
• aligning words
Word Alignment Clues
NP
DT
NNP
VP
NNP
NN
VBZ
ADVP
VBN
RB
[ The United Nations conference
conference ][ has started][ today] .
 Idag började FN-konferensen
konferensen .
RGOS V@IIAS
ADVP
VC
NCUSN@DS
NP
Word Alignment Clues
Def.: A word alignment clue Ci(s,t) is a
probability which indicates an association
between two lexical items, s and t, from
parallel texts.
Def.: A lexical item is a set of words with
associated features attached to it.
How do we find clues? (1)
Clues can be estimated from association
scores:
Ci(s,t) = wi * Ai (s,t)
co-occurrence:
• Dice coefficient:
A1 (s,t) = Dice (s,t)
• Mutual information: A2 (s,t) = I (s;t)
string similarity
• longest common sub-seq.ratio: A3 (s,t) = LCSR (s,t)
How do we find clues? (2)
Clues can be estimated from training
data:
Ci(s,t) = wi * P (ft |fs)  wi * freq(ft ,fs
)/freq(fs)
fs , ft are features of s and t, e.g.
•
•
•
•
part-of-speech sequences of s, t
phrase category (NP, VP etc), syntactic function
word position
context features
How do we use clues? (1)
Clues are simply sets of association measures
The crucial point: we have to combine them!
If Ci(s,t) = P(ai ), define the total clue as
 Call(s,t) = P(A) = P(a1 a2  ... an)
Clues are not mutually exclusive!
 P(a1 a2 ) = P(a1) + P(a2 ) - P(a1 a2 )
Assume independence!
 P(a1 a2 ) = P(a1) * P(a2 )
How do we use clues? (2)
Clues can refer to any set of tokens from
source and target language segments.
 overlaps
 inclusions
Def.: A clue shares its indication with all
member tokens!
 allow clue combinations at the level of single
tokens
Clue overlaps - an example
The United Nations conference has started today.
Idag började FN-konferensen.
Clue 1 (co-occurrence)
Clue 2 (string similarity)
United Nations
FN-konferensen 0.4
Nations conference FN-konferensen 0.5
United
FN-konferense 0.3
conference
Nations
FN-konferensen 0.57
FN-konferensen 0.29
Clueall
United
Nations
conference
FN-konferensen 0.58
FN-konferensen 0.787
FN-konferensen 0.785
The Clue Matrix
Idag började FN-konferensen
The
United
Nations
Conference
has
started
today
0.4
0.4
0.2
0.72
0.3
0.5
0.7
0.5
0.787
0.7
0.5
0.57
0.58
0.3
Clue 2 (string similarity)
conference
FN-konferensen
Nations
FN-konferensen
today
idag
0.57
0.29
0.4
Clue 1 (co-occurrence)
The United Nations FN-konferensen
United Nations
FN-konferensen
has
började
started
började
started today
idag
Nations conference började
0.5
0.4
0.2
0.6
0.3
0.4
Clue Alignment (1)
general principles:
combine all clues and fill the matrix
highest score = best link
allow overlapping links only
• if there is no better link for both tokens
• if tokens are next to each other
links which overlap at one point form a link
cluster
Clue Alignment (2)
the alignment procedure:
1. find the best link
2. remove the best link (set its value to 0)
3. check for overlaps
• accept: add to set of link clusters
• dismiss otherwise
4. continue with 1 until no more links are found
(or all values are below a certain threshold)
Clue Alignment (3)
Idag började FN-konferensen
The
United
Nations
conference
has
started
today
Best link:
0.4
0.3
0.4
0.2
0
0.72
0
0.5
0
0.7
0
0.5
0
0.787
0
0.7
0.5
0
0.57
0
0.58
0.3
0
LinkLink
clusters:
Link
clusters:
clusters:
The United
The
United
Nations
Nations
United
United
Nations
Nations
Nations
Nations
conference
conference
conference
FN-konferensen
FN-konferensen
FN-konferensen
FN-konferensen
FN-konferensen
The
has
Nations
started
United
today
conference
idagFN-konferensen
FN-konferensen
började
0.58
0.787
0.72
0.7
0.2
0.5 0.57
started
has
started
started
started
started
började
började
började
todaytoday
today
idag idag
Bootstrapping
again: clues can be estimated from
training data
self-training: use available links as training
data
goal: learn new clues for the next step
risk: increased noise (lower precision)
Learning Clues
POS-clue:
assumption: word pairs with certain POS-tags are
more likely to be translations of each other than
other word pairs
features: POS-tag sequences
position clue:
assumption: translations are relatively close to
each other (esp. in related languages)
features: relative word positions
So much for the theory!
Results?!
The setup: Corpus and basic tools:
• Saul Bellow’s “To Jerusalem and back: a personal
account ”, English/Swedish, about 170,000 words
• English POS-tagger (Grok), trained on Brown, PTB
• English shallow parser (Grok), trained on PTB
• English stemmer, suffix truncation
• Swedish POS-tagger (TnT), trained on SUC
• Swedish CFG parser (Megyesi), rule-based
• Swedish lemmatiser, database taken from SUC
Results!?! … not yet
basic clues:
• Dice coefficient ( 0.3)
• LCSR (0.4),  3 characters/string
learned clues:
• POS clue
• position clue
clue alignment threshold = 0.4
uniform normalisation (0.5)
Results!!! Come on!
Preliminary results (… work in progress …)
 Evaluation: 500 random samples have been linked
manually (Gold standard)
 Metrics: precisionPWA & recallPWA (Ahrenberg et al, 2000)
alignment & clues
Dice+LCSR (best-first)
Dice+LCSR
Dice+LCSR+POS
Dice+LCSR+POS+position
precision
79.377%
71.225%
70.667%
72.820%
recall
F
32.454% 46.071%
41.065% 52.095%
48.566% 57.568%
51.561% 60.374%
Give me more numbers!
The impact of parsing.
How much do we gain?
Alignment results with n-grams, (shallow)
parsing, and both:
chunks+ngrams
ngrams
chunks
ngrams+chunks
precision recall
74.712% 51.501%
78.410% 52.909%
72.820% 51.561%
F
60.972%
63.183%
60.374%
One more thing.
Stemming, lemmatisation and all that …
Do we need morphological analyses for
Swedish and English?
word/lemma/stem
precision recall
words
79.490% 48.827%
swedish & english stems
77.401% 45.338%
swedish lemmas+english stems 78.410% 52.909%
F
60.495%
57.181%
63.183%
Conclusions
Combining clues helps to find links
Linguistic knowledge helps
POS tags are valuable clues
word position gives hints for related languages
parsing helps with the segmentation problem
lemmatisation gives higher recall
We need more experiments, tests with other
language pairs, more/other clues
recall & precision is still low
POS clues - examples
score
source
target
---------------------------------------------------------0.915479582146249
VBZ
V@IPAS
0.91304347826087
WRB
RH0S
0.761904761904762
VBP
V@IPAS
0.701943844492441
RB
RG0S
0.674033149171271
VBD
V@IIAS
0.666666666666667
DT NNP NN
NCUSN@DS
0.647058823529412
PRP VBZ
PF@USS@S V@IPAS
0.625
NNS NNP
NP00N@0S
0.611859838274933
VB
V@N0AS
0.6
RBR
RGCS
0.5
DT JJ JJ NN DF@US@S AQP0SNDS NCUSN@DS
Position clues - examples
score
mapping
-----------------------------------0.245022348638765
x -> 0
0.12541095637398
x -> -1
0.0896900742491966
x -> 1
0.0767611096745595
x -> -2
0.0560378264563555
x -> -3
0.0514572790070555
x -> 2
0.0395256916996047
x -> 6 7 8
Open Questions
Normalisation!
How do we estimate the wi’s?
Non-contiguous phrases
Why not allow long distance clusters?
Independence assumption
What is the impact of dependencies?
Alignment clues
What is a bad clue, what is a good one?
Contextual clues
Clue alignment - example
amused
,
my
wife
asks
why
i
ordered
the
kosher
lunch
.
min
0
0
81
58
0
0
0
0
0
0
0
0
be
var
ställ
fru undrar road för jag de en
0
0
0
0
0
0
0
0
0
0
0
0
0
0
63
0
0
0
0
0
0
80
0
0
0
0
0
0
0 42
0
0
0
0
0
0
0
0
74
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
36
0
0
0
0
0
0
0 70
34
0
0
0
0
0 53
34
0
0
0
0
0 41
0
0
0
0
0
0
0
ko
scher
lunch
0
0
0
0
0
0
0
0
70
86
81
0
.
0
48
0
0
0
0
0
0
0
0
0
76
Alignment - examples
the Middle East
afford
at least
an American satellite
common sense
Jerusalem area
kosher lunch
leftist anti-Semitism
left-wing intellectuals
literary history
manuscript collection
Marine orchestra
marionette theater
mathematical colleagues
mental character
far too
Mellersta Östern
kosta på
åtminstone
en satellit
sunda förnuftet
Jerusalemområdet
koscherlunch
vänsterantisemitism
vänsterintellektuella
litteraturhistoriska
handskriftsamling
marinkårsorkester
marionetteatern
matematikkolleger
mentalitet
alldeles
Alignment - examples
a banquet
a battlefield
a day
the Arab states
the Arab world
the baggage carousel
the Communist dictatorships
The Fatah terrorists
the defense minister
the defense minister
the daughter
the first President
en bankett
ett slagfält
dagen
arabstaterna
arabvärlden
bagagekarusellen
kommunistdiktaturerna
Al Fatah-terroristerna
försvarsministern
försvarsminister
dotter
förste president
Alignment - examples
American imperial interests
Chicago schools
decidedly anti-Semitic
his identity
his interest
his interviewer
militant Islam
no longer
sophisticated arms
still clearly
dozen Russian
exceedingly intelligent
few drinks
goyish democracy
industrialized countries
has become
amerikanska imperialistintressenas
Chicagos skolor
avgjort antisemitiska
sin identitet
sitt intresse
hans intervjuare
militanta muhammedanismen
inte längre
avancerade vapen
uppenbarligen ännu
dussin ryska
utomordentligt intelligent
några drinkar
gojernas demokrati
industrialiserade länderna
har blivit
Gold standard - MWUs
link:
link type:
unit type:
Secretary of State -> Utrikesminister
regular
multi -> single
source text: Secretary of State Henry Kissinger has won the
Middle Eastern struggle by drawing Egypt into the American
camp.
target text: Utrikesminister Henry Kissinger har vunnit
slaget om Mellanöstern genom att dra in Egypten i det
amerikanska lägret.
Gold standard - fuzzy links
link:
link type:
unit type:
unrelated -> inte tillhör hans släkt
fuzzy
single -> multi
source text: And though he is not permitted to sit beside
women unrelated to him or to look at them or to communicate
with them in any manner (all of which probably saves him a
great deal of trouble), he seems a good-hearted young man
and he is visibly enjoying himself.
target text: Och fastän han inte får sitta bredvid kvinnor
som inte tillhör hans släkt eller se på dem eller meddela
sig med dem på något sätt (alltsammans saker som utan tvivel
besparar honom en mängd bekymmer) verkar han vara en
godhjärtad ung man, och han ser ut att trivas gott.
Gold standard - null links
link:
link type:
unit type:
do ->
null
single -> null
source text:"How is it that you do not know English?"
target text:"Hur kommer det sig att ni inte talar engelska?"
Gold standard - morphology
link:
link type:
unit type:
the masses -> massorna
regular
multi -> single
source text: Arafat was unable to complete the classic
guerrilla pattern and bring the masses into the struggle.
target text: Arafat har inte kunnat fullborda det klassiska
gerillamönstret och föra in massorna i kampen.
Evaluation metrics
Q
C src  C trg
max( S src , G src )  max( S trg , G trg )
precision PW A 
recall PW A 
Q
n( I )  n( P )  n(C )
Q
n( I )  n( P )  n(C )  n( M )
 Csrc – number of overlapping source tokens in (partially) correct link proposals,
Csrc=0 for incorrect link proposals
 Ctrg – number of overlapping target tokens in (partially) correct link proposals,
Ctrg=0 for incorrect link proposals
 Ssrc – number of source tokens proposed by the system
 Strg – number of target tokens proposed by the system
 Gsrc – number of source tokens in the gold standard
 Gtrg – number of target tokens in the gold standard
Evaluation metrics example
reference
proposed
reference
proposed
reference
proposed
reference
proposed
reference
proposed
reference
proposed
reference
proposed
source
Reläventil TC
Reläventil
TC
ordinarie
ordinarie skruv
kommer att indikeras
det kommer
att
indikeras
vill
vatten
to
to
Scanias chassier
Scanias
target
TC relay valve
Relay valve
TC
ordinary
ordinary
will be indicated
will
the
indicated
wants
till
att
Scania chassis
Scania chassis
precisionPWA
recallPWA
(3/5 = 0.6) +
(2/5 = 0.4) = 1
(3/5 = 0.6) +
(2/5 = 0.4) = 1
2/3  0.66
2/3  0.66
(2/7  0.286) +
(0/7 = 0) +
(2/7  0.286)

(2/7  0.286) +
(0/7 = 0) +
(2/7  0.286)

0
0
1
1
0
0
3/4 = 0.75
3/4 = 0.75
/6  0.663 /7  0.569
Corpus markup (Swedish)
<s lang="sv" id="9">
<c id="c-1" type="NP">
<w span="0:3" pos="PF@NS0@S" id="w9-1" stem="det">Det</w>
</c>
<c id="c-2" type="VC">
<w span="4:2" pos="V@IPAS" id="w9-2" stem="vara">är</w>
</c>
<c id="c-3">
<w span="7:3" pos="CCS" id="w9-3" stem=”som">som</w>
</c>
<c id="c-4" type="NPMAX">
<c id="c-5" type="NP">
<w span="11:3" pos="DI@NS@S" id="w9-4" stem="en">ett</w>
<w span="15:5" pos="NCNSN@IS" id="w9-5">besök</w>
</c>
<c id="c-6" type="PP">
<c id="c-7">
<w span="21:1" pos="SPS" id="w9-6" stem="1">i</w>
</c>
<c id="c-8" type="NP">
<w span="23:9" pos="NCUSN@DS" id="w9-7" stem="barndom">barndomen</w>
</c>
</c>
</c>
</s>
Corpus markup (English)
<s lang="en" id="9">
<chunk type="NP" id="c-1">
<w span="0:2" pos="PRP" id="w9-1">It</w>
</chunk>
<chunk type="VP" id="c-2">
<w span="3:2" pos="VBZ" id="w9-2” stem="be">is</w>
</chunk>
<chunk type="NP" id="c-3">
<w span="6:2" pos="PRP$" id="w9-3">my</w>
<w span="9:9" pos="NN" id="w9-4">childhood</w>
</chunk>
<chunk type="VP" id="c-4">
<w span="19:9" pos="VBD" id="w9-5">revisited</w>
</chunk>
<chunk id="c-5">
<w span="28:1" pos="." id="w9-6">.</w>
</chunk>
</s>
… is that all?
How good are the new clues?
Alignment results with learned clues only:
(neither LCSR nor Dice)
clues only
POS
position
precision recall
55.178% 20.383%
37.169% 21.550%
F
29.769%
27.282%
Download