Catch the Link! Combining Clues for Word Alignment Jörg Tiedemann Uppsala University joerg@stp.ling.uu.se Outline Background What do we want? What do we have? What do we need? Clue Alignment What is a clue? How do we find clues? How do we use clues? What do we get? What do we want? Source automatically Translation 1 Translation 2 Aligned corpus Word aligner Token links language independent Type links Sentence aligner Parallel corpus What do we have? tokeniser (ca 99%) POS tagger (ca 96%) lemmatiser (ca 99%) shallow parser (ca 92%), parser (> 80%) sentence aligner (ca 96%) word aligner 75% precision 45% recall What’s the problem with (1) Alsop says, "I have a horror of the bad American Word Alignment? practice of choosing up sides people's politics, ..." (1) Neutralitetspolitiken stödsinavother ett starkt försvar (2) Alsop "Jagoberoende. fasar för den amerikanska ovanan tillförklarar: värn för vårt (1) Armén kommer att reformeras och att sida i andra människors politik, ...”by a (2) välja Our policy of neutrality is underpinned Word alignment challenges: effektiviseras. (1) I take the middle seat, which I dislike, but strong defence. (2) non-linear The Iarmy will bemapping reorganized withback: the aaim of account”) (Saul Bellow “To Jerusalem personal am not really put out. and making it more vilket effective. (The Declarations ofdifferences the Swedish (2) grammatical/lexical Jag tar mittplatsen, jag inte tyckerGovernment, om, men 1988) görismig intelate så mycket. (1) Our det Hasid in his twenties. (The Declarations translation gaps of the Swedish Government, 1988) (2) Vår chassid är bortåt de trettio. (Saul Bellow “To Jerusalem translation extensionsand back: a personal account”) (Saul Bellow “To Jerusalem and back: a personal account”) idiomatic expressions multi-word equivalences So what? What are the real problems? Word alignment uses simple, fixed tokenisation fails to identify appropriate translation units ignores contextual dependencies ignores relevant linguistic information uses poor morphological analyses What do we need? flexible tokenisation possible multi-word units linguistic tools for several languages integration of linguistic knowledge combination of knowledge resources alignment in context Let’s go! Clue Alignment! • finding clues • combining clues • aligning words Word Alignment Clues NP DT NNP VP NNP NN VBZ ADVP VBN RB [ The United Nations conference conference ][ has started][ today] . Idag började FN-konferensen konferensen . RGOS V@IIAS ADVP VC NCUSN@DS NP Word Alignment Clues Def.: A word alignment clue Ci(s,t) is a probability which indicates an association between two lexical items, s and t, from parallel texts. Def.: A lexical item is a set of words with associated features attached to it. How do we find clues? (1) Clues can be estimated from association scores: Ci(s,t) = wi * Ai (s,t) co-occurrence: • Dice coefficient: A1 (s,t) = Dice (s,t) • Mutual information: A2 (s,t) = I (s;t) string similarity • longest common sub-seq.ratio: A3 (s,t) = LCSR (s,t) How do we find clues? (2) Clues can be estimated from training data: Ci(s,t) = wi * P (ft |fs) wi * freq(ft ,fs )/freq(fs) fs , ft are features of s and t, e.g. • • • • part-of-speech sequences of s, t phrase category (NP, VP etc), syntactic function word position context features How do we use clues? (1) Clues are simply sets of association measures The crucial point: we have to combine them! If Ci(s,t) = P(ai ), define the total clue as Call(s,t) = P(A) = P(a1 a2 ... an) Clues are not mutually exclusive! P(a1 a2 ) = P(a1) + P(a2 ) - P(a1 a2 ) Assume independence! P(a1 a2 ) = P(a1) * P(a2 ) How do we use clues? (2) Clues can refer to any set of tokens from source and target language segments. overlaps inclusions Def.: A clue shares its indication with all member tokens! allow clue combinations at the level of single tokens Clue overlaps - an example The United Nations conference has started today. Idag började FN-konferensen. Clue 1 (co-occurrence) Clue 2 (string similarity) United Nations FN-konferensen 0.4 Nations conference FN-konferensen 0.5 United FN-konferense 0.3 conference Nations FN-konferensen 0.57 FN-konferensen 0.29 Clueall United Nations conference FN-konferensen 0.58 FN-konferensen 0.787 FN-konferensen 0.785 The Clue Matrix Idag började FN-konferensen The United Nations Conference has started today 0.4 0.4 0.2 0.72 0.3 0.5 0.7 0.5 0.787 0.7 0.5 0.57 0.58 0.3 Clue 2 (string similarity) conference FN-konferensen Nations FN-konferensen today idag 0.57 0.29 0.4 Clue 1 (co-occurrence) The United Nations FN-konferensen United Nations FN-konferensen has började started började started today idag Nations conference började 0.5 0.4 0.2 0.6 0.3 0.4 Clue Alignment (1) general principles: combine all clues and fill the matrix highest score = best link allow overlapping links only • if there is no better link for both tokens • if tokens are next to each other links which overlap at one point form a link cluster Clue Alignment (2) the alignment procedure: 1. find the best link 2. remove the best link (set its value to 0) 3. check for overlaps • accept: add to set of link clusters • dismiss otherwise 4. continue with 1 until no more links are found (or all values are below a certain threshold) Clue Alignment (3) Idag började FN-konferensen The United Nations conference has started today Best link: 0.4 0.3 0.4 0.2 0 0.72 0 0.5 0 0.7 0 0.5 0 0.787 0 0.7 0.5 0 0.57 0 0.58 0.3 0 LinkLink clusters: Link clusters: clusters: The United The United Nations Nations United United Nations Nations Nations Nations conference conference conference FN-konferensen FN-konferensen FN-konferensen FN-konferensen FN-konferensen The has Nations started United today conference idagFN-konferensen FN-konferensen började 0.58 0.787 0.72 0.7 0.2 0.5 0.57 started has started started started started började började började todaytoday today idag idag Bootstrapping again: clues can be estimated from training data self-training: use available links as training data goal: learn new clues for the next step risk: increased noise (lower precision) Learning Clues POS-clue: assumption: word pairs with certain POS-tags are more likely to be translations of each other than other word pairs features: POS-tag sequences position clue: assumption: translations are relatively close to each other (esp. in related languages) features: relative word positions So much for the theory! Results?! The setup: Corpus and basic tools: • Saul Bellow’s “To Jerusalem and back: a personal account ”, English/Swedish, about 170,000 words • English POS-tagger (Grok), trained on Brown, PTB • English shallow parser (Grok), trained on PTB • English stemmer, suffix truncation • Swedish POS-tagger (TnT), trained on SUC • Swedish CFG parser (Megyesi), rule-based • Swedish lemmatiser, database taken from SUC Results!?! … not yet basic clues: • Dice coefficient ( 0.3) • LCSR (0.4), 3 characters/string learned clues: • POS clue • position clue clue alignment threshold = 0.4 uniform normalisation (0.5) Results!!! Come on! Preliminary results (… work in progress …) Evaluation: 500 random samples have been linked manually (Gold standard) Metrics: precisionPWA & recallPWA (Ahrenberg et al, 2000) alignment & clues Dice+LCSR (best-first) Dice+LCSR Dice+LCSR+POS Dice+LCSR+POS+position precision 79.377% 71.225% 70.667% 72.820% recall F 32.454% 46.071% 41.065% 52.095% 48.566% 57.568% 51.561% 60.374% Give me more numbers! The impact of parsing. How much do we gain? Alignment results with n-grams, (shallow) parsing, and both: chunks+ngrams ngrams chunks ngrams+chunks precision recall 74.712% 51.501% 78.410% 52.909% 72.820% 51.561% F 60.972% 63.183% 60.374% One more thing. Stemming, lemmatisation and all that … Do we need morphological analyses for Swedish and English? word/lemma/stem precision recall words 79.490% 48.827% swedish & english stems 77.401% 45.338% swedish lemmas+english stems 78.410% 52.909% F 60.495% 57.181% 63.183% Conclusions Combining clues helps to find links Linguistic knowledge helps POS tags are valuable clues word position gives hints for related languages parsing helps with the segmentation problem lemmatisation gives higher recall We need more experiments, tests with other language pairs, more/other clues recall & precision is still low POS clues - examples score source target ---------------------------------------------------------0.915479582146249 VBZ V@IPAS 0.91304347826087 WRB RH0S 0.761904761904762 VBP V@IPAS 0.701943844492441 RB RG0S 0.674033149171271 VBD V@IIAS 0.666666666666667 DT NNP NN NCUSN@DS 0.647058823529412 PRP VBZ PF@USS@S V@IPAS 0.625 NNS NNP NP00N@0S 0.611859838274933 VB V@N0AS 0.6 RBR RGCS 0.5 DT JJ JJ NN DF@US@S AQP0SNDS NCUSN@DS Position clues - examples score mapping -----------------------------------0.245022348638765 x -> 0 0.12541095637398 x -> -1 0.0896900742491966 x -> 1 0.0767611096745595 x -> -2 0.0560378264563555 x -> -3 0.0514572790070555 x -> 2 0.0395256916996047 x -> 6 7 8 Open Questions Normalisation! How do we estimate the wi’s? Non-contiguous phrases Why not allow long distance clusters? Independence assumption What is the impact of dependencies? Alignment clues What is a bad clue, what is a good one? Contextual clues Clue alignment - example amused , my wife asks why i ordered the kosher lunch . min 0 0 81 58 0 0 0 0 0 0 0 0 be var ställ fru undrar road för jag de en 0 0 0 0 0 0 0 0 0 0 0 0 0 0 63 0 0 0 0 0 0 80 0 0 0 0 0 0 0 42 0 0 0 0 0 0 0 0 74 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 36 0 0 0 0 0 0 0 70 34 0 0 0 0 0 53 34 0 0 0 0 0 41 0 0 0 0 0 0 0 ko scher lunch 0 0 0 0 0 0 0 0 70 86 81 0 . 0 48 0 0 0 0 0 0 0 0 0 76 Alignment - examples the Middle East afford at least an American satellite common sense Jerusalem area kosher lunch leftist anti-Semitism left-wing intellectuals literary history manuscript collection Marine orchestra marionette theater mathematical colleagues mental character far too Mellersta Östern kosta på åtminstone en satellit sunda förnuftet Jerusalemområdet koscherlunch vänsterantisemitism vänsterintellektuella litteraturhistoriska handskriftsamling marinkårsorkester marionetteatern matematikkolleger mentalitet alldeles Alignment - examples a banquet a battlefield a day the Arab states the Arab world the baggage carousel the Communist dictatorships The Fatah terrorists the defense minister the defense minister the daughter the first President en bankett ett slagfält dagen arabstaterna arabvärlden bagagekarusellen kommunistdiktaturerna Al Fatah-terroristerna försvarsministern försvarsminister dotter förste president Alignment - examples American imperial interests Chicago schools decidedly anti-Semitic his identity his interest his interviewer militant Islam no longer sophisticated arms still clearly dozen Russian exceedingly intelligent few drinks goyish democracy industrialized countries has become amerikanska imperialistintressenas Chicagos skolor avgjort antisemitiska sin identitet sitt intresse hans intervjuare militanta muhammedanismen inte längre avancerade vapen uppenbarligen ännu dussin ryska utomordentligt intelligent några drinkar gojernas demokrati industrialiserade länderna har blivit Gold standard - MWUs link: link type: unit type: Secretary of State -> Utrikesminister regular multi -> single source text: Secretary of State Henry Kissinger has won the Middle Eastern struggle by drawing Egypt into the American camp. target text: Utrikesminister Henry Kissinger har vunnit slaget om Mellanöstern genom att dra in Egypten i det amerikanska lägret. Gold standard - fuzzy links link: link type: unit type: unrelated -> inte tillhör hans släkt fuzzy single -> multi source text: And though he is not permitted to sit beside women unrelated to him or to look at them or to communicate with them in any manner (all of which probably saves him a great deal of trouble), he seems a good-hearted young man and he is visibly enjoying himself. target text: Och fastän han inte får sitta bredvid kvinnor som inte tillhör hans släkt eller se på dem eller meddela sig med dem på något sätt (alltsammans saker som utan tvivel besparar honom en mängd bekymmer) verkar han vara en godhjärtad ung man, och han ser ut att trivas gott. Gold standard - null links link: link type: unit type: do -> null single -> null source text:"How is it that you do not know English?" target text:"Hur kommer det sig att ni inte talar engelska?" Gold standard - morphology link: link type: unit type: the masses -> massorna regular multi -> single source text: Arafat was unable to complete the classic guerrilla pattern and bring the masses into the struggle. target text: Arafat har inte kunnat fullborda det klassiska gerillamönstret och föra in massorna i kampen. Evaluation metrics Q C src C trg max( S src , G src ) max( S trg , G trg ) precision PW A recall PW A Q n( I ) n( P ) n(C ) Q n( I ) n( P ) n(C ) n( M ) Csrc – number of overlapping source tokens in (partially) correct link proposals, Csrc=0 for incorrect link proposals Ctrg – number of overlapping target tokens in (partially) correct link proposals, Ctrg=0 for incorrect link proposals Ssrc – number of source tokens proposed by the system Strg – number of target tokens proposed by the system Gsrc – number of source tokens in the gold standard Gtrg – number of target tokens in the gold standard Evaluation metrics example reference proposed reference proposed reference proposed reference proposed reference proposed reference proposed reference proposed source Reläventil TC Reläventil TC ordinarie ordinarie skruv kommer att indikeras det kommer att indikeras vill vatten to to Scanias chassier Scanias target TC relay valve Relay valve TC ordinary ordinary will be indicated will the indicated wants till att Scania chassis Scania chassis precisionPWA recallPWA (3/5 = 0.6) + (2/5 = 0.4) = 1 (3/5 = 0.6) + (2/5 = 0.4) = 1 2/3 0.66 2/3 0.66 (2/7 0.286) + (0/7 = 0) + (2/7 0.286) (2/7 0.286) + (0/7 = 0) + (2/7 0.286) 0 0 1 1 0 0 3/4 = 0.75 3/4 = 0.75 /6 0.663 /7 0.569 Corpus markup (Swedish) <s lang="sv" id="9"> <c id="c-1" type="NP"> <w span="0:3" pos="PF@NS0@S" id="w9-1" stem="det">Det</w> </c> <c id="c-2" type="VC"> <w span="4:2" pos="V@IPAS" id="w9-2" stem="vara">är</w> </c> <c id="c-3"> <w span="7:3" pos="CCS" id="w9-3" stem=”som">som</w> </c> <c id="c-4" type="NPMAX"> <c id="c-5" type="NP"> <w span="11:3" pos="DI@NS@S" id="w9-4" stem="en">ett</w> <w span="15:5" pos="NCNSN@IS" id="w9-5">besök</w> </c> <c id="c-6" type="PP"> <c id="c-7"> <w span="21:1" pos="SPS" id="w9-6" stem="1">i</w> </c> <c id="c-8" type="NP"> <w span="23:9" pos="NCUSN@DS" id="w9-7" stem="barndom">barndomen</w> </c> </c> </c> </s> Corpus markup (English) <s lang="en" id="9"> <chunk type="NP" id="c-1"> <w span="0:2" pos="PRP" id="w9-1">It</w> </chunk> <chunk type="VP" id="c-2"> <w span="3:2" pos="VBZ" id="w9-2” stem="be">is</w> </chunk> <chunk type="NP" id="c-3"> <w span="6:2" pos="PRP$" id="w9-3">my</w> <w span="9:9" pos="NN" id="w9-4">childhood</w> </chunk> <chunk type="VP" id="c-4"> <w span="19:9" pos="VBD" id="w9-5">revisited</w> </chunk> <chunk id="c-5"> <w span="28:1" pos="." id="w9-6">.</w> </chunk> </s> … is that all? How good are the new clues? Alignment results with learned clues only: (neither LCSR nor Dice) clues only POS position precision recall 55.178% 20.383% 37.169% 21.550% F 29.769% 27.282%