Rapid Prototyping of a Transfer-based Hebrew-to-English Machine Translation System Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with: Shuly Wintner, Danny Shacham, Nurit Melnik, Yuval Krymolowski - University of Haifa Erik Peterson – Carnegie Mellon University Outline • • • • • • • • • Context of this Work CMU Statistical Transfer MT Framework Hebrew and its Challenges for MT Hebrew-to-English System Morphological Analysis and Generation MT Resources: lexicon and grammar Translation Examples Performance Evaluation Conclusions, Current and Future Work June 20, 2007 ISCOL/BISFAI-2007 2 Current State-of-the-art in Machine Translation • MT underwent a major paradigm shift over the past 15 years: – From manually crafted rule-based systems with manually designed knowledge resources – To search-based approaches founded on automatic extraction of translation models/units from large sentenceparallel corpora • Current Dominant Approach: Phrase-based Statistical MT: – Extract and statistically model large volumes of phrase-tophrase correspondences from automatically word-aligned parallel corpora – “Decode” new input by searching for the most likely sequence of phrase matches, using a statistical Language Model for the target language June 20, 2007 ISCOL/BISFAI-2007 3 Current State-of-the-art in Machine Translation • Phrase-based MT State-of-the-art: – Requires minimally several million words of parallel text for adequate training – Limited to language-pairs for which such data exists: major European languages, Chinese, Japanese, a few others… – Linguistically shallow and highly lexicalized models result in weak generalization – Best performance levels (BLEU=~0.6) on Arabic-toEnglish provide understandable but often still somewhat disfluent translations – Ill suited for Hebrew and most of the world’s minor languages June 20, 2007 ISCOL/BISFAI-2007 4 CMU’s Statistical-Transfer (XFER) Approach • Framework: Statistical search-based approach with syntactic translation transfer rules that can be acquired from data but also developed and extended by experts • Elicitation: use bilingual native informants to produce a small high-quality word-aligned bilingual corpus of translated phrases and sentences • Transfer-rule Learning: apply ML-based methods to automatically acquire syntactic transfer rules for translation between the two languages • XFER + Decoder: – XFER engine produces a lattice of possible transferred structures at all levels – Decoder searches and selects the best scoring combination • Rule Refinement: refine the acquired rules via a process of interaction with bilingual informants • Word and Phrase bilingual lexicon acquisition June 20, 2007 ISCOL/BISFAI-2007 5 Hebrew Input בשורה הבאה Transfer Rules {NP1,3} NP1::NP1 [NP1 "H" ADJ] -> [ADJ NP1] ((X3::Y1) (X1::Y2) ((X1 def) = +) ((X1 status) =c absolute) ((X1 num) = (X3 num)) ((X1 gen) = (X3 gen)) (X0 = X1)) Preprocessing Morphology English Language Model Transfer Engine Translation Lexicon N::N |: ["$WR"] -> ["BULL"] ((X1::Y1) ((X0 NUM) = s) ((Y0 lex) = "BULL")) N::N |: ["$WRH"] -> ["LINE"] ((X1::Y1) ((X0 NUM) = s) ((Y0 lex) = "LINE")) Decoder Translation Output Lattice (0 1 "IN" @PREP) (1 1 "THE" @DET) (2 2 "LINE" @N) (1 2 "THE LINE" @NP) (0 2 "IN LINE" @PP) (0 4 "IN THE NEXT LINE" @PP) English Output in the next line Transfer Rule Formalism ;SL: the old man, TL: ha-ish ha-zaqen Type information Part-of-speech/constituent information Alignments x-side constraints [DET ADJ N] -> [DET N DET ADJ] ((X1 AGR) = *3-SING) ((X1 DEF = *DEF) ((X3 AGR) = *3-SING) ((X3 COUNT) = +) y-side constraints xy-constraints, e.g. ((Y1 AGR) = (X1 AGR)) June 20, 2007 NP::NP ( (X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2) ((Y1 DEF) = *DEF) ((Y3 DEF) = *DEF) ((Y2 AGR) = *3-SING) ((Y2 GENDER) = (Y4 GENDER)) ) ISCOL/BISFAI-2007 7 The Transfer Engine • Main algorithm: chart-style bottom-up integrated parsing+transfer with beam pruning – Seeded by word-to-word translations – Driven by transfer rules – Generates a lattice of transferred translation segments at all levels • Some Unique Features: – Works with either learned or manually-developed transfer grammars – Handles rules with or without unification constraints – Supports interfacing with servers for morphological analysis and generation – Can handle ambiguous source-word analyses and/or SL segmentations represented in the form of lattice structures June 20, 2007 ISCOL/BISFAI-2007 8 XFER Output Lattice (28 (29 (29 (29 (30 (30 (30 (30 (30 (30 (30 28 29 29 29 30 30 30 30 30 30 30 "AND" -5.6988 "W" "(CONJ,0 'AND')") "SINCE" -8.20817 "MAZ " "(ADVP,0 (ADV,5 'SINCE')) ") "SINCE THEN" -12.0165 "MAZ " "(ADVP,0 (ADV,6 'SINCE THEN')) ") "EVER SINCE" -12.5564 "MAZ " "(ADVP,0 (ADV,4 'EVER SINCE')) ") "WORKED" -10.9913 "&BD " "(VERB,0 (V,11 'WORKED')) ") "FUNCTIONED" -16.0023 "&BD " "(VERB,0 (V,10 'FUNCTIONED')) ") "WORSHIPPED" -17.3393 "&BD " "(VERB,0 (V,12 'WORSHIPPED')) ") "SERVED" -11.5161 "&BD " "(VERB,0 (V,14 'SERVED')) ") "SLAVE" -13.9523 "&BD " "(NP0,0 (N,34 'SLAVE')) ") "BONDSMAN" -18.0325 "&BD " "(NP0,0 (N,36 'BONDSMAN')) ") "A SLAVE" -16.8671 "&BD " "(NP,1 (LITERAL 'A') (NP2,0 (NP1,0 (NP0,0 (N,34 'SLAVE')) ) ) ) ") (30 30 "A BONDSMAN" -21.0649 "&BD " "(NP,1 (LITERAL 'A') (NP2,0 (NP1,0 (NP0,0 (N,36 'BONDSMAN')) ) ) ) ") June 20, 2007 ISCOL/BISFAI-2007 9 The Lattice Decoder • Simple Stack Decoder, similar in principle to simple Statistical MT decoders • Searches for best-scoring path of non-overlapping lattice arcs • No reordering during decoding • Scoring based on log-linear combination of scoring components, with weights trained using MERT • Scoring components: – Statistical Language Model – Fragmentation: how many arcs to cover the entire translation? – Length Penalty – Rule Scores – Lexical Probabilities (not fully integrated) June 20, 2007 ISCOL/BISFAI-2007 10 XFER Lattice Decoder 00 ON THE FOURTH DAY THE LION ATE THE RABBIT TO A MORNING MEAL Overall: -8.18323, Prob: -94.382, Rules: 0, Frag: 0.153846, Length: 0, Words: 13,13 235 < 0 8 -19.7602: B H IWM RBI&I (PP,0 (PREP,3 'ON')(NP,2 (LITERAL 'THE') (NP2,0 (NP1,1 (ADJ,2 (QUANT,0 'FOURTH'))(NP1,0 (NP0,1 (N,6 'DAY')))))))> 918 < 8 14 -46.2973: H ARIH AKL AT H $PN (S,2 (NP,2 (LITERAL 'THE') (NP2,0 (NP1,0 (NP0,1 (N,17 'LION')))))(VERB,0 (V,0 'ATE'))(NP,100 (NP,2 (LITERAL 'THE') (NP2,0 (NP1,0 (NP0,1 (N,24 'RABBIT')))))))> 584 < 14 17 -30.6607: L ARWXH BWQR (PP,0 (PREP,6 'TO')(NP,1 (LITERAL 'A') (NP2,0 (NP1,0 (NNP,3 (NP0,0 (N,32 'MORNING'))(NP0,0 (N,27 'MEAL')))))))> June 20, 2007 ISCOL/BISFAI-2007 11 XFER MT Prototypes • General XFER framework under development for past five years • Prototype systems so far: – – – – – German-to-English Dutch-to-English Chinese-to-English Hindi-to-English Hebrew-to-English – – – – – Mapudungun-to-Spanish Quechua-to-Spanish Brazilian Portuguese-to-English Native-Brazilian languages to Brazilian Portuguese Hebrew-to-Arabic • In progress or planned: June 20, 2007 ISCOL/BISFAI-2007 12 Challenges for Hebrew MT • Puacity in existing language resources for Hebrew – No publicly available broad coverage morphological analyzer – No publicly available bilingual lexicons or dictionaries – No POS-tagged corpus or parse tree-bank corpus for Hebrew – No large Hebrew/English parallel corpus • Scenario well suited for CMU transfer-based MT framework for languages with limited resources June 20, 2007 ISCOL/BISFAI-2007 13 Modern Hebrew Spelling • Two main spelling variants – “KTIV XASER” (difficient): spelling with the vowel diacritics, and consonant words when the diacritics are removed – “KTIV MALEH” (full): words with I/O/U vowels are written with long vowels which include a letter • KTIV MALEH is predominant, but not strictly adhered to even in newspapers and official publications inconsistent spelling • Example: – niqud (spelling): NIQWD, NQWD, NQD – When written as NQD, could also be niqed, naqed, nuqad June 20, 2007 ISCOL/BISFAI-2007 14 Morphological Analyzer • We use a publicly available morphological analyzer distributed by the Technion’s Knowledge Center, adapted for our system • Coverage is reasonable (for nouns, verbs and adjectives) • Produces all analyses or a disambiguated analysis for each word • Output format includes lexeme (base form), POS, morphological features • Output was adapted to our representation needs (POS and feature mappings) June 20, 2007 ISCOL/BISFAI-2007 15 Morphology Example • Input word: B$WRH 0 1 2 3 4 |--------B$WRH--------| |-----B-----|$WR|--H--| |--B--|-H--|--$WRH---| June 20, 2007 ISCOL/BISFAI-2007 16 Morphology Example Y0: ((SPANSTART 0) (SPANEND 4) (LEX B$WRH) (POS N) (GEN F) (NUM S) (STATUS ABSOLUTE)) Y1: ((SPANSTART 0) (SPANEND 2) (LEX B) (POS PREP)) Y2: ((SPANSTART 1) (SPANEND 3) (LEX $WR) (POS N) (GEN M) (NUM S) (STATUS ABSOLUTE)) Y3: ((SPANSTART 3) (SPANEND 4) (LEX $LH) (POS POSS)) Y4: ((SPANSTART 0) (SPANEND 1) (LEX B) (POS PREP)) Y5: ((SPANSTART 1) (SPANEND 2) (LEX H) (POS DET)) Y6: ((SPANSTART 2) (SPANEND 4) (LEX $WRH) (POS N) (GEN F) (NUM S) (STATUS ABSOLUTE)) Y7: ((SPANSTART 0) (SPANEND 4) (LEX B$WRH) (POS LEX)) June 20, 2007 ISCOL/BISFAI-2007 17 Translation Lexicon • Constructed our own Hebrew-to-English lexicon, based primarily on existing “Dahan” H-to-E and E-to-H dictionary made available to us, augmented by other public sources • Coverage is not great but not bad as a start – Dahan H-to-E is about 15K translation pairs – Dahan E-to-H is about 7K translation pairs • Base forms, POS information on both sides • Converted Dahan into our representation, added entries for missing closed-class entries (pronouns, prepositions, etc.) • Had to deal with spelling conventions • Recently augmented with ~50K translation pairs extracted from Wikipedia (mostly proper names and named entities) June 20, 2007 ISCOL/BISFAI-2007 18 Manual Transfer Grammar (human-developed) • Initially developed by Alon in a couple of days, extended and revised by Nurit over time • Current grammar has 36 rules: – – – – 21 NP rules one PP rule 6 verb complexes and VP rules 8 higher-phrase and sentence-level rules • Captures the most common (mostly local) structural differences between Hebrew and English June 20, 2007 ISCOL/BISFAI-2007 19 Transfer Grammar Example Rules {NP1,2} ;;SL: $MLH ADWMH ;;TL: A RED DRESS {NP1,3} ;;SL: H $MLWT H ADWMWT ;;TL: THE RED DRESSES NP1::NP1 [NP1 ADJ] -> [ADJ NP1] ( (X2::Y1) (X1::Y2) ((X1 def) = -) ((X1 status) =c absolute) ((X1 num) = (X2 num)) ((X1 gen) = (X2 gen)) (X0 = X1) ) NP1::NP1 [NP1 "H" ADJ] -> [ADJ NP1] ( (X3::Y1) (X1::Y2) ((X1 def) = +) ((X1 status) =c absolute) ((X1 num) = (X3 num)) ((X1 gen) = (X3 gen)) (X0 = X1) ) June 20, 2007 ISCOL/BISFAI-2007 20 Hebrew-to-English MT Prototype • Initial prototype developed within a two month intensive effort • Accomplished: – – – – – – – Adapted available morphological analyzer Constructed a preliminary translation lexicon Translated and aligned Elicitation Corpus Learned XFER rules Developed (small) manual XFER grammar System debugging and development Evaluated performance on unseen test data using automatic evaluation metrics June 20, 2007 ISCOL/BISFAI-2007 21 Example Translation • Input: – לאחר דיונים רבים החליטה הממשלה לערוך משאל עם בנושא הנסיגה – After debates many decided the government to hold referendum in issue the withdrawal • Output: – AFTER MANY DEBATES THE GOVERNMENT DECIDED TO HOLD A REFERENDUM ON THE ISSUE OF THE WITHDRAWAL June 20, 2007 ISCOL/BISFAI-2007 22 Noun Phrases – Construct State החלטת הנשיא הראשון HXL@T [HNSIA HRA$WN] decision.3SF-CS the-president.3SM the-first.3SM THE DECISION OF THE FIRST PRESIDENT החלטת הנשיא הראשונה [HXL@T HNSIA] decision.3SF-CS the-president.3SM HRA$WNH the-first.3SF THE FIRST DECISION OF THE PRESIDENT June 20, 2007 ISCOL/BISFAI-2007 23 Noun Phrases - Possessives הנשיא הכריז שהמשימה הראשונה שלו תהיה למצוא פתרון לסכסוך באזורנו HNSIA HKRIZ $HM$IMH HRA$WNH $LW THIH the-president announced that-the-task.3SF the-first.3SF of-him will.3SF LMCWA PTRWN LSKSWK to-find solution to-the-conflict BAZWRNW in-region-POSS.1P Without transfer grammar: THE PRESIDENT ANNOUNCED THAT THE TASK THE BEST OF HIM WILL BE TO FIND SOLUTION TO THE CONFLICT IN REGION OUR With transfer grammar: THE PRESIDENT ANNOUNCED THAT HIS FIRST TASK WILL BE TO FIND A SOLUTION TO THE CONFLICT IN OUR REGION June 20, 2007 ISCOL/BISFAI-2007 24 Subject-Verb Inversion אתמול הודיעה הממשלה שתערכנה בחירות בחודש הבא ATMWL HWDI&H HMM$LH yesterday announced.3SF the-government.3SF $T&RKNH BXIRWT BXWD$ HBA that-will-be-held.3PF elections.3PF in-the-month the-next Without transfer grammar: YESTERDAY ANNOUNCED THE GOVERNMENT THAT WILL RESPECT OF THE FREEDOM OF THE MONTH THE NEXT With transfer grammar: YESTERDAY THE GOVERNMENT ANNOUNCED THAT ELECTIONS WILL ASSUME IN THE NEXT MONTH June 20, 2007 ISCOL/BISFAI-2007 25 Subject-Verb Inversion לפני כמה שבועות הודיעה הנהלת המלון שהמלון יסגר בסוף השנה LPNI before KMH $BW&WT HWDI&H HNHLT HMLWN several weeks announced.3SF management.3SF.CS the-hotel $HMLWN ISGR BSWF H$NH that-the-hotel.3SM will-be-closed.3SM at-end.3SM.CS the-year Without transfer grammar: IN FRONT OF A FEW WEEKS ANNOUNCED ADMINISTRATION THE HOTEL THAT THE HOTEL WILL CLOSE AT THE END THIS YEAR With transfer grammar: SEVERAL WEEKS AGO THE MANAGEMENT OF THE HOTEL ANNOUNCED THAT THE HOTEL WILL CLOSE AT THE END OF THE YEAR June 20, 2007 ISCOL/BISFAI-2007 26 Evaluation Results • Test set of 62 sentences from Haaretz newspaper, 2 reference translations System BLEU NIST P R METEOR No Gram 0.0616 3.4109 0.4090 0.4427 0.3298 Learned 0.0774 3.5451 0.4189 0.4488 0.3478 Manual 0.1026 3.7789 0.4334 0.4474 0.3617 June 20, 2007 ISCOL/BISFAI-2007 27 Current and Future Work • Issues specific to the Hebrew-to-English system: – Coverage: further improvements in the translation lexicon and morphological analyzer – Manual Grammar development – Acquiring/training of word-to-word translation probabilities – Acquiring/training of a Hebrew language model at a postmorphology level that can help with disambiguation • General Issues related to XFER framework: – – – – Discriminative Language Modeling for MT Effective models for assigning scores to transfer rules Improved grammar learning Merging/integration of manual and acquired grammars June 20, 2007 ISCOL/BISFAI-2007 28 Conclusions • Test case for the CMU XFER framework for rapid MT prototyping • Preliminary system was a two-month, three person effort – we were quite happy with the outcome • Core concept of XFER + Decoding is very powerful and promising for MT • We experienced the main bottlenecks of knowledge acquisition for MT: morphology, translation lexicons, grammar... June 20, 2007 ISCOL/BISFAI-2007 29 Questions? June 20, 2007 ISCOL/BISFAI-2007 30