Lecture 8: Statistical Alignment & Machine Translation (Chapter 13 of Manning & Schutze) Wen-Hsiang Lu (盧文祥) Department of Computer Science and Information Engineering, National Cheng Kung University 2014/05/12 (Slides from Berlin Chen, National Taiwan Normal University http://140.122.185.120/PastCourses/2004SNaturalLanguageProcessing/NLP_main_2004S.htm) References: 1. 2. 3. Brown, P., Pietra, S. A. D., Pietra, V. D. J., Mercer, R. L. The Mathematics of Machine Translation, Computational Linguistics, 1993. Knight, K. Automating Knowledge Acquisition for Machine Translation, AI Magazine,1997. W. A. Gale and K. W. Church. A Program for Aligning Sentences in Bilingual Corpora, Computational Linguistics 1993. 1 Machine Translation (MT) • Definition – Automatic translation of text or speech from one language to another • Goal – Produce close to error-free output that reads fluently in the target language – Far from it ? • Current Status – Existing systems seem not good in quality • Babel Fish Translation (before 2004) • Google Translation – A mix of probabilistic and non-probabilistic components 2 Issues • Build high-quality semantic-based MT systems in circumscribed domains • Abandon automatic MT, build software to assist human translators instead – Post-edit the output of a buggy translation • Develop automatic knowledge acquisition techniques for improving general-purpose MT – Supervised or unsupervised learning 3 Different Strategies for MT Interlingua (knowledge representation) knowledge-based translation English (semantic representation) English (syntactic parse) English Text (word string) semantic transfer syntactic transfer word-for-word French (semantic representation) French (syntactic parse) French Text (word string) 4 Word for Word MT • Translate words one-by-one from one language to another – Problems 1. No one-to-one correspondence between words in different languages (lexical ambiguity) – Need to look at the context larger than individual word (→ phrase or clause) 2. Languages have different word orders English suit French lawsuit (訴訟), set of garments (服裝) meanings 5 Syntactic Transfer MT • Parse the source text, then transfer the parse tree of the source text into a syntactic tree in the target language, and then generate the translation from this syntactic tree – Solve the problems of word ordering – Problems • Syntactic ambiguity • The target syntax will likely mirror that of the source text N V Adv German: Ich esse gern ( I like to eat ) English: I eat readily/gladly 6 Text Alignment: Definition • Definition – Align paragraphs, sentences or words in one language to paragraphs, sentences or words in another languages • Thus can learn which words tend to be translated by which other words in another language bilingual dictionaries, MT , parallel grammars … – Is not part of MT process per se • But the obligatory first step for making use of multilingual text corpora • Applications – – – – Bilingual lexicography Machine translation Multilingual information retrieval … 7 Text Alignment: Sources and Granularities • Sources of parallel texts or bitexts – Parliamentary proceedings (Hansards) – Newspapers and magazines – Religious and literary works with less literal translation • Two levels of alignment – Gross large scale alignment • Learn which paragraphs or sentences correspond to which paragraphs or sentences in another language – Word alignment • Learn which words tend to be translated by which words in another language • The necessary step for acquiring a bilingual dictionary Orders of word or sentence might not be preserved. 8 Text Alignment: Example 1 2:2 alignment 9 Text Alignment: Example 2 2:2 alignment 1:1 alignment 1:1 alignment 2:1 alignment a bead/a sentence alignment Studies show that around 90% of alignments are 1:1 sentence alignment. 10 Sentence Alignment • Crossing dependencies are not allowed here – Word ordering is preserved ! • Related work 11 Sentence Alignment • Length-based • Lexical-guided • Offset-based 12 Sentence Alignment Length-based method • Rationale: the short sentences will be translated as short sentences and long sentences as long sentences – Length is defined as the number of words or the number of characters • Approach 1 (Gale & Church 1993) Union Bank of Switzerland (UBS) corpus : English, French, and German – Assumptions • The paragraph structure was clearly marked in the corpus, confusions are checked by hand s1 s2 s3 s4 . . . sI t1 t2 t3 t4 . . . . tJ • Lengths of sentences measured in characters • • Crossing dependences are not handled here – The order of sentences are not changed in the translation Ignore the rich information available in the text. 13 Sentence Alignment Length-based method source Source B1 S s1s2 sI T t1t2 t J Target B2 B3 Bk s1 s2 s3 s4 . . . sI target t1 t2 t3 t4 . . . . tJ possible alignments: {1:1, 1:0, 0:1, 2:1,1:2, 2:2,…} a bead probability independence between beads arg max PA S , T arg max P A, S , T PBk K A A where A B1 , B2 ,...,Bk k 1 14 Sentence Alignment Length-based method – Dynamic Programming • The cost function (Distance Measure) cost align l1 , l2 log P align l1 , l2 , , s 2 log P align P l1 , l2 , , s 2 align log PBk l1 , l2 , , s 2 l2 l1 Ratio of texts in two languages l1s 2 is a normal distribution square difference of two paragraphs L2 L1 • Sentence is the unit of alignment • Statistically modeling of character lengths Bayes’ Law P l1, l2 , , s 2 align 21 prob The prob. distribution of standard normal distribution 15 Sentence Alignment Length-based method • The priori probability Or P(α align) Source si si-1 si-2 tj-2 tj-1 tj Di, j 1 cost 0 : 1 align , t j Di 1, j cost1 : 0 align si , Di 1, j 1 cost 1 : 1 alignsi , t j Di, j min Di 1, j 2 cost 1 : 2 alignsi , t j 1 , t j Di 2, j 1 cost 2 : 1 alignsi 1 , si , t j Di 2, j 2 cost 2 : 2 alignsi 1 , si , t j 1 , t j Target 16 Sentence Alignment Length-based method – A simple example L1 alignment 1 L1 alignment 2 s1 t1 cost(align(s1, s2, t1)) t1 + s2 t2 t2 s3 t3 s4 cost(align(s3, t2)) + cost(align(s4, t3)) t3 cost(align(s1, t1)) + cost(align(s2, t2)) + cost(align(s3,Ø)) + cost(align(s4, t3)) 17 Sentence Alignment Length-based method – 4% error rate was achieved – Problems: • Can not handle noisy and imperfect input – E.g., OCR output or file containing unknown markup conventions – Finding paragraph or sentence boundaries is difficult – Solution: just align text (position) offsets in two parallel texts (Church 1993) • Questionable for languages with few cognates (同源) or different writing systems – E.g., English ←→ Chinese eastern European languages ←→ Asian languages 18 Sentence Alignment Lexical method • Approach 1 (Kay and Röscheisen 1993) – First assume the first and last sentences of the text were align as the initial anchors – Form an envelope of possible alignments • Alignments excluded when sentences across anchors or their respective distance from an anchor differ greatly – Choose word pairs their distributions are similar in most of the sentences – Find pairs of source and target sentences which contain many possible lexical correspondences • The most reliable of pairs are used to induce a set of partial alignment (add to the list of anchors) Iterations 19 Word Alignment • The sentence/offset alignment can be extended to a word alignment • Some criteria are then used to select aligned word pairs to include them into the bilingual dictionary – Frequency of word correspondences – Association measures – …. 20 Statistical Machine Translation • The noisy channel model Language Model Pe e f Translation Model P f e Decoder eˆ arg maxe Pe f eˆ e: English f: French eak fk ea j |e|=l fj |f|=m – Assumptions: • An English word can be aligned with multiple French words while each French word is aligned with at most one English word • Independence of the individual word-to-word translations 21 Statistical Machine Translation • Three important components involved – Decoder eˆ arg max Pe f arg max e pe p f e e arg max pe p f e p f e – Language model • Give the probability p(e) – Translation model m l 1 l P f e ... P f Z a1 0 am 0 j 1 all possible normalization alignments constant (the English word that a French word fj is aligned with) j ea j translation probability 22 An Introduction to Statistical Machine Translation Dept. of CSIE, NCKU Yao-Sheng Chang Date: 2011.04.12 23 Introduction • Translations involve many cultural respects – We only consider the translation of individual sentence, just acceptable sentences. • Every sentence in one language is a possible translation of any sentence in the other – Assign (S,T) a probability, Pr(T|S), to be the probability that a translator will produce T in the target language when presented with S in the source language. 24 Statistical Machine Translation (SMT) • Noise channel problem 25 Fundamental of SMT • Given a string of French f, the job of our translation system is to find the string e that the speaker had in mind when he produced f. (Baye’s theorem) • Since denominator Pr(f) here is a constant, the best e is one which has the greatest probability. 26 Practical Challenges • Computation of translation model Pr(f|e) • Computation of language model Pr(e) • Decoding (i.e., search for e that maximize Pr(f|e) Pr(e)) 27 Alignment of case 1 (one-to-many) 28 Alignment of case 2 (many-to-one) 29 Alignment of case 3 (many-to-many) 30 Formulation of Alignment(1) • Let e = e1le1e2…el and f = f1m f1f2…fm • An alignment between a pair of strings e and f use a mapping of every word ei to some word fj e = e1e2…ei…el aj =i f = f1 f2… fj… fm fj = eaj = ei • In other words, an alignment a between e and f tells that the word ei, 1 i l is generated by the word fj, aj{1,…,l} • There are (l+1)m different alignments between e and f. (Including Null – no mapping ) 31 Translation Model • The alignment, a, can be represented by a series, a1m = ala2... am, of m values, each between 0 and l such that if the word in position j of the French string is connected to the word in position i of the English string, then aj = i , and if it is not connected to any English word, then aj = 0 (null). 32 IBM Model I (1) 33 IBM Model I (2) • The alignment is determined by specifying the values of aj for j from 1 to m, each of which can take any value from 0 to l. 34 Constrained Maximization • We wish to adjust the translation probabilities so as to maximize Pr(f|e ) subject to the constraints that for each e 35 Lagrange Multipliers (1) • Method of Lagrange multipliers(拉格朗乘數法): Lagrange multipliers with one constraint – If there is a maximum or minimum subject to the constraint g(x,y) = 0, then it will occur at one of the critical numbers of the function F defined by is called the F ( x, y, ) f ( x, y) g ( x, y) – – – – f(x,y) is called the objective function(目標函數). g(x,y) is called the constrained equation(條件限制方程式). F(x, y, ) is called the Lagrange function(拉格朗函數). is called the Lagrange multiplier(拉格朗乘數). 36 Lagrange Multipliers (2) • Following standard practice for constrained maximization, we introduce Lagrange multipliers e, and seek an unconstrained extremum of the auxiliary function 37 Derivation (1) • The partial derivative of h with respect to t(f|e) is • where is the Kronecker delta function, equal to one when both of its arguments are the same and equal to zero otherwise 38 Derivation (2) • We call the expected number of times that e connects to f in the translation (f|e) the count of f given e for (f|e) and denote it by c(f|e; f, e). By definition, 39 Derivation (3) • replacing e by ePr(f|e), then Equation (11) can be written very compactly as • In practice, our training data consists of a set of translations, (f(1) le(1)), (f(2) le(2)), ..., (f(s) le(s)), , so this equation becomes 40 Derivation (4) • For an expression that can be evaluated efficiently. 41 Derivation (5) • Thus, the number of operations necessary to calculate a count is proportional to l + m rather than to (l + 1)m as Equation (12) 42 Other References about MT 44 Chinese-English Sentence Alignment • 林語君、高照明,結合統計與語言 訊息的混合式中英雙語句對應演算 法,ROCLING XVI,2004 • 問題:中文句子的界線並不清楚, 句點與逗點都有可能是句子的界線。 • 方法 – 群組句對應模組 • 2 Stages Iterative DP – 句對應參數 • 採用廣泛的字典翻譯詞以及Stop list • 中文翻譯詞的部分比對(Partial Match) • 句中重要標點符號序列相似度 • 共同數字詞、時間詞、原文詞之對 應錨 – 群組句對應參數 • 句長相異度 • 句末標點符號要求相同 45 Bilingual Collocation Extraction Based on Syntactic and Statistical Analyses (Chien-Cheng Wu & Jason S. Chang, ROCLING’03) 46 Bilingual Collocation Extraction Based on Syntactic and Statistical Analyses (Chien-Cheng Wu & Jason S. Chang, ROCLING’03) • Preprocessing steps to calculate the following information: – – – – Lists of preferred POS patterns of collocation in both languages. Collocation candidates matching the preferred POS patterns. N-gram statistics for both languages, N = 1, 2. Log likelihood Ratio statistics for two consecutive words in both languages. – Log likelihood Ratio statistics for a pair of candidates of bilingual collocation across from one language to the other. – Content word alignment based on Competitive Linking Algorithm (Melamed 1997). 47 Bilingual Collocation Extraction Based on Syntactic and Statistical Analyses (Chien-Cheng Wu & Jason S. Chang, ROCLING’03) 48 Bilingual Collocation Extraction Based on Syntactic and Statistical Analyses (Chien-Cheng Wu & Jason S. Chang, ROCLING’03) 49 A Probability Model to Improve Word Alignment Colin Cherry & Dekang Lin, University of Alberta Outline • Introduction • Probabilistic Word-Alignment Model • Word-Alignment Algorithm – Constraints – Features Introduction • Word-aligned corpora are an excellent source of translation-related knowledge in statistical machine translation. – E.g., translation lexicons, transfer rules • Word-alignment problem – Conventional approaches usually used co-occurrence models • E.g., Ø2 (Gale & Church 1991), log-likelihood ratio (Dunning 1993) – Indirect association problem: Melamed (2000) proposed competitive linking along with an explicit noise model to solve B(links(u, v) | cooc(u, v), ) scoreB(u, v) log B(links(u, v) | cooc(u, v), ) CISCO System Inc. 思科 系統 • To propose a probabilistic word-alignment model which allows easy integration of context-specific features. noise Probabilistic Word-Alignment Model • Given E = e1, e2, …, em , F = f1, f2, …, fn – If ei and fj are translation pair, then link l(ei, fj) exists – If ei has no corresponding translation, then null link l(ei, f0) exists – If fj has no corresponding translation, then null link l(e0, fj) exists – An alignment A is a set of links such that every word in E and F participates in at least one link • Alignment problem is to find alignment A to maximize P(A|E, F) • IBM’s translation model: maximize P(A, F|E) Probabilistic Word-Alignment Model (Cont.) • Given A = {l1, l2, …, lt}, where lk = l(eik, fjk), then j consecutive subsets of A, li = {li, li+1, …, lj} P( A | E, F ) P(l1t t | E, F ) P(lk | E, F , l1k 1 ) k 1 k-1 • Let Ck= {E, F, l1 } represent the context of lk P(lk | Ck ) P(lk , Ck ) P(Ck | lk ) P(lk ) P(Ck ) P(Ck , eik , f jk ) P(lk , eik , f jk ) P(Ck | lk ) P(Ck | eik , f jk ) P(eik , f jk ) P(Ck | lk ) P(lk | eik , f jk ) P(Ck | eik , f jk ) P(Ck , eik , f jk ) P(Ck ) P(eik , f jk | Ck ) P(eik , f jk | Ck ) 1 P(lk , eik , f jk ) P(lk ) P(eik , f jk | lk ) P(eik , f jk | lk ) 1 Probabilistic Word-Alignment Model (Cont.) k-1 • Ck = {E, F, l1 } is too complex to estimate • FTk is a set of context-related features such that P(lk|Ck) can be approximated by P(lk|eik, fjk, FTk) • Let Ck’ = {eik, fjk} ∪ FTk P(lk | Ck' ) P(lk | eik , f jk ) P(lk | eik , f jk ) P(Ck' | lk ) P(Ck' | eik , f jk ) P( FTk | lk ) P( FTk | eik , f jk ) P( ft | lk ) P( A | E , F ) P(lk | eik , f jk ) k 1 ftFTk P ( ft | eik , f jk ) t An Illustrative Example Word-Alignment Algorithm • Input: E, F, TE – TE is E’s dependency tree which enable us to make use of features and constraints based on linguistic intuitions • Constraints – One-to-one constraint: every word participates in exactly one link – Cohesion constraint: use TE to induce TF with no crossing dependencies Word-Alignment Algorithm (Cont.) • Features – Adjacency features fta: for any word pair (ei, fj), if a link l(ei’, fj’) exists where -2 i’-i 2 and -2 j’-j 2, then fta(i-i’, j-j’, ei’) is active for this context. – Dependency features ftd: for any word pair (ei, fj), let ei’ be the governor of ei ,and let rel be the grammatical relationship between them. If a link l(ei’, fj’) exists, then ftd(j-j’, rel) is active for this context. fta ( 1,1, host) pair (the1 , l ' ) ftd ( 1, det) pair (the1 , les) ftd ( 3, det) Experimental Results • Test bed: Hansard corpus – Training: 50K aligned pairs of sentences (Och & Ney 2000) – Testing: 500 pairs Future Work • The alignment algorithm presented here is incapable of creating alignments that are not one-to-one, many-to-one alignment will be pursued. • The proposed model is capable of creating many-to-one alignments as the null probabilities of the words added on the “many” side.