Statistical Alignment and Machine Translation

advertisement
Lecture 8: Statistical Alignment &
Machine Translation
(Chapter 13 of Manning & Schutze)
Wen-Hsiang Lu (盧文祥)
Department of Computer Science and Information Engineering,
National Cheng Kung University
2014/05/12
(Slides from Berlin Chen, National Taiwan Normal University
http://140.122.185.120/PastCourses/2004SNaturalLanguageProcessing/NLP_main_2004S.htm)
References:
1.
2.
3.
Brown, P., Pietra, S. A. D., Pietra, V. D. J., Mercer, R. L. The Mathematics of
Machine Translation, Computational Linguistics, 1993.
Knight, K. Automating Knowledge Acquisition for Machine Translation, AI
Magazine,1997.
W. A. Gale and K. W. Church. A Program for Aligning Sentences in Bilingual
Corpora, Computational Linguistics 1993.
1
Machine Translation (MT)
• Definition
– Automatic translation of text or speech from one language to
another
• Goal
– Produce close to error-free output that reads fluently in the target
language
– Far from it ?
• Current Status
– Existing systems seem not good in quality
• Babel Fish Translation (before 2004)
• Google Translation
– A mix of probabilistic and non-probabilistic components
2
Issues
• Build high-quality semantic-based MT systems in
circumscribed domains
• Abandon automatic MT, build software to assist human
translators instead
– Post-edit the output of a buggy translation
• Develop automatic knowledge acquisition techniques for
improving general-purpose MT
– Supervised or unsupervised learning
3
Different Strategies for MT
Interlingua
(knowledge representation)
knowledge-based
translation
English
(semantic
representation)
English
(syntactic parse)
English Text
(word string)
semantic transfer
syntactic transfer
word-for-word
French
(semantic
representation)
French
(syntactic parse)
French Text
(word string)
4
Word for Word MT
• Translate words one-by-one from one language to
another
– Problems
1. No one-to-one correspondence between words in different
languages (lexical ambiguity)
– Need to look at the context larger than individual word
(→ phrase or clause)
2. Languages have different word orders
English
suit
French
lawsuit (訴訟), set of garments (服裝)
meanings
5
Syntactic Transfer MT
• Parse the source text, then transfer the parse tree of the
source text into a syntactic tree in the target language,
and then generate the translation from this syntactic tree
– Solve the problems of word ordering
– Problems
• Syntactic ambiguity
• The target syntax will likely mirror that of the source text
N
V
Adv
German: Ich esse gern ( I like to eat )
English: I eat readily/gladly
6
Text Alignment: Definition
• Definition
– Align paragraphs, sentences or words in one language to
paragraphs, sentences or words in another languages
• Thus can learn which words tend to be translated by which
other words in another language
bilingual dictionaries, MT , parallel grammars …
– Is not part of MT process per se
• But the obligatory first step for making use of multilingual text
corpora
• Applications
–
–
–
–
Bilingual lexicography
Machine translation
Multilingual information retrieval
…
7
Text Alignment: Sources and Granularities
• Sources of parallel texts or bitexts
– Parliamentary proceedings (Hansards)
– Newspapers and magazines
– Religious and literary works
with less literal
translation
• Two levels of alignment
– Gross large scale alignment
• Learn which paragraphs or sentences correspond to which
paragraphs or sentences in another language
– Word alignment
• Learn which words tend to be translated by which words in
another language
• The necessary step for acquiring a bilingual dictionary
Orders of word or sentence might not be preserved.
8
Text Alignment: Example 1
2:2 alignment
9
Text Alignment: Example 2
2:2 alignment
1:1 alignment
1:1 alignment
2:1 alignment
a bead/a sentence alignment
Studies show that around 90% of alignments are 1:1 sentence alignment.
10
Sentence Alignment
• Crossing dependencies are not allowed here
– Word ordering is preserved !
• Related work
11
Sentence Alignment
• Length-based
• Lexical-guided
• Offset-based
12
Sentence Alignment
Length-based method
• Rationale: the short sentences will be translated as
short sentences and long sentences as long sentences
– Length is defined as the number of words or the number of
characters
• Approach 1 (Gale & Church 1993)
Union Bank of Switzerland (UBS) corpus
: English, French, and German
– Assumptions
• The paragraph structure was clearly marked in the corpus,
confusions are checked by hand
s1
s2
s3
s4
.
.
.
sI
t1
t2
t3
t4
.
.
.
.
tJ
• Lengths of sentences measured in characters
•
• Crossing dependences are not handled here
– The order of sentences are not changed in the translation
Ignore the rich information available in the text.
13
Sentence Alignment
Length-based method
source
Source
B1
S  s1s2  sI
T  t1t2 t J
Target
B2
B3
Bk
s1
s2
s3
s4
.
.
.
sI
target
t1
t2
t3
t4
.
.
.
.
tJ
possible alignments:
{1:1, 1:0, 0:1, 2:1,1:2, 2:2,…}
a bead
probability independence
between beads
arg max PA S , T   arg max P A, S , T    PBk 
K
A
A
where A  B1 , B2 ,...,Bk 
k 1
14
Sentence Alignment
Length-based method
– Dynamic Programming
• The cost function (Distance Measure)


cost align l1 , l2    log P  align  l1 , l2 ,  , s 2



  log P align P  l1 , l2 ,  , s 2  align
 log PBk 
 l1 , l2 ,  , s 2   l2  l1 
Ratio of texts in two languages


l1s 2   is a normal distribution
square difference of two
paragraphs
L2

L1
• Sentence is the unit of alignment
• Statistically modeling of character lengths


Bayes’ Law

P  l1, l2 , , s 2  align  21  prob

The prob. distribution
of standard normal
distribution
15
Sentence Alignment
Length-based method
• The priori probability
Or P(α align)
Source

si
si-1
si-2
tj-2 tj-1
tj

Di, j  1  cost 0 : 1 align , t j


Di  1, j   cost1 : 0 align si ,  


Di  1, j  1  cost 1 : 1 alignsi , t j

Di, j   min 
 Di  1, j  2  cost 1 : 2 alignsi , t j 1 , t j
 Di  2, j  1  cost 2 : 1 alignsi 1 , si , t j


 Di  2, j  2   cost 2 : 2 alignsi 1 , si , t j 1 , t j
Target








16
Sentence Alignment
Length-based method
– A simple example
L1 alignment 1
L1 alignment 2
s1
t1
cost(align(s1, s2, t1)) t1
+
s2
t2
t2
s3
t3
s4
cost(align(s3, t2))
+
cost(align(s4, t3))
t3
cost(align(s1, t1))
+
cost(align(s2, t2))
+
cost(align(s3,Ø))
+
cost(align(s4, t3))
17
Sentence Alignment
Length-based method
– 4% error rate was achieved
– Problems:
• Can not handle noisy and imperfect input
– E.g., OCR output or file containing unknown markup
conventions
– Finding paragraph or sentence boundaries is difficult
– Solution: just align text (position) offsets in two parallel
texts (Church 1993)
• Questionable for languages with few cognates (同源) or
different writing systems
– E.g., English ←→ Chinese
eastern European languages ←→ Asian languages
18
Sentence Alignment
Lexical method
• Approach 1 (Kay and Röscheisen 1993)
– First assume the first and last sentences of the text were align as
the initial anchors
– Form an envelope of possible alignments
• Alignments excluded when sentences
across anchors or their respective
distance from an anchor differ greatly
– Choose word pairs their distributions are similar
in most of the sentences
– Find pairs of source and target sentences which contain many
possible lexical correspondences
• The most reliable of pairs are used to induce a set of partial
alignment (add to the list of anchors)
Iterations
19
Word Alignment
• The sentence/offset alignment can be extended
to a word alignment
• Some criteria are then used to select aligned
word pairs to include them into the bilingual
dictionary
– Frequency of word correspondences
– Association measures
– ….
20
Statistical Machine Translation
• The noisy channel model
Language Model
Pe 
e
f
Translation Model
P f e
Decoder
eˆ  arg maxe Pe f 
eˆ
e: English f: French
eak
fk
ea j
|e|=l
fj
|f|=m
– Assumptions:
• An English word can be aligned with multiple French words
while each French word is aligned with at most one English
word
• Independence of the individual word-to-word translations
21
Statistical Machine Translation
• Three important components involved
– Decoder
eˆ  arg max Pe f   arg max
e
pe  p  f e 
e
 arg max pe  p f e 
p f 
e
– Language model
• Give the probability p(e)
– Translation model

m
l
1 l
P f e 
 ...   P f
Z a1  0 am  0 j 1
all possible
normalization
alignments
constant
(the English word that a French
word fj is aligned with)
j
ea j

translation
probability
22
An Introduction to Statistical Machine
Translation
Dept. of CSIE, NCKU
Yao-Sheng Chang
Date: 2011.04.12
23
Introduction
• Translations involve many cultural respects
– We only consider the translation of individual
sentence, just acceptable sentences.
• Every sentence in one language is a possible
translation of any sentence in the other
– Assign (S,T) a probability, Pr(T|S), to be the
probability that a translator will produce T in the
target language when presented with S in the source
language.
24
Statistical Machine Translation
(SMT)
• Noise channel problem
25
Fundamental of SMT
• Given a string of French f, the job of our
translation system is to find the string e that the
speaker had in mind when he produced f.
(Baye’s theorem)
• Since denominator Pr(f) here is a constant, the
best e is one which has the greatest probability.
26
Practical Challenges
• Computation of translation model Pr(f|e)
• Computation of language model Pr(e)
• Decoding (i.e., search for e that maximize
Pr(f|e)  Pr(e))
27
Alignment of case 1 (one-to-many)
28
Alignment of case 2 (many-to-one)
29
Alignment of case 3 (many-to-many)
30
Formulation of Alignment(1)
• Let e = e1le1e2…el and f = f1m  f1f2…fm
• An alignment between a pair of strings e and f use a
mapping of every word ei to some word fj
e = e1e2…ei…el
aj =i
f = f1 f2… fj… fm
fj = eaj = ei
• In other words, an alignment a between e and f tells
that the word ei, 1 i  l is generated by the word fj,
aj{1,…,l}
• There are (l+1)m different alignments between e and f.
(Including Null – no mapping )
31
Translation Model
• The alignment, a, can be represented by a
series, a1m = ala2... am, of m values, each
between 0 and l such that if the word in position j
of the French string is connected to the word in
position i of the English string, then aj = i , and if
it is not connected to any English word,
then aj = 0 (null).
32
IBM Model I (1)

33
IBM Model I (2)
• The alignment is determined by specifying the
values of aj for j from 1 to m, each of which can
take any value from 0 to l.
34
Constrained Maximization
• We wish to adjust the translation probabilities so
as to maximize Pr(f|e ) subject to the constraints
that for each e
35
Lagrange Multipliers (1)
• Method of Lagrange multipliers(拉格朗乘數法):
Lagrange multipliers with one constraint
– If there is a maximum or minimum subject to the constraint
g(x,y) = 0, then it will occur at one of the critical numbers of the
function F defined by is called the
F ( x, y,  )  f ( x, y)  g ( x, y)
–
–
–
–
f(x,y) is called the objective function(目標函數).
g(x,y) is called the constrained equation(條件限制方程式).
F(x, y, ) is called the Lagrange function(拉格朗函數).
 is called the Lagrange multiplier(拉格朗乘數).
36
Lagrange Multipliers (2)
• Following standard practice for constrained maximization,
we introduce Lagrange multipliers e, and seek an
unconstrained extremum of the auxiliary function
37
Derivation (1)
• The partial derivative of h with respect to t(f|e) is
• where  is the Kronecker delta function, equal to
one when both of its arguments are the same and
equal to zero otherwise
38
Derivation (2)
• We call the expected number of times that e connects to
f in the translation (f|e) the count of f given e for (f|e) and
denote it by c(f|e; f, e). By definition,
39
Derivation (3)
• replacing e by ePr(f|e), then Equation (11) can
be written very compactly as
• In practice, our training data consists of a set of
translations, (f(1) le(1)), (f(2) le(2)), ..., (f(s) le(s)), , so
this equation becomes
40
Derivation (4)
• For an expression that can be evaluated
efficiently.
41
Derivation (5)
• Thus, the number of operations necessary to
calculate a count is proportional to l + m rather
than to
(l + 1)m as Equation (12)
42
Other References about MT
44
Chinese-English Sentence Alignment
• 林語君、高照明,結合統計與語言
訊息的混合式中英雙語句對應演算
法,ROCLING XVI,2004
• 問題:中文句子的界線並不清楚,
句點與逗點都有可能是句子的界線。
• 方法
– 群組句對應模組
• 2 Stages Iterative DP
– 句對應參數
• 採用廣泛的字典翻譯詞以及Stop
list
• 中文翻譯詞的部分比對(Partial
Match)
• 句中重要標點符號序列相似度
• 共同數字詞、時間詞、原文詞之對
應錨
– 群組句對應參數
• 句長相異度
• 句末標點符號要求相同
45
Bilingual Collocation Extraction Based on
Syntactic and Statistical Analyses
(Chien-Cheng Wu & Jason S. Chang, ROCLING’03)
46
Bilingual Collocation Extraction Based on
Syntactic and Statistical Analyses
(Chien-Cheng Wu & Jason S. Chang, ROCLING’03)
• Preprocessing steps to calculate the following
information:
–
–
–
–
Lists of preferred POS patterns of collocation in both languages.
Collocation candidates matching the preferred POS patterns.
N-gram statistics for both languages, N = 1, 2.
Log likelihood Ratio statistics for two consecutive words in both
languages.
– Log likelihood Ratio statistics for a pair of candidates of bilingual
collocation across from one language to the other.
– Content word alignment based on Competitive Linking Algorithm
(Melamed 1997).
47
Bilingual Collocation Extraction Based on
Syntactic and Statistical Analyses
(Chien-Cheng Wu & Jason S. Chang, ROCLING’03)
48
Bilingual Collocation Extraction Based on
Syntactic and Statistical Analyses
(Chien-Cheng Wu & Jason S. Chang, ROCLING’03)
49
A Probability Model to Improve
Word Alignment
Colin Cherry & Dekang Lin,
University of Alberta
Outline
• Introduction
• Probabilistic Word-Alignment Model
• Word-Alignment Algorithm
– Constraints
– Features
Introduction
• Word-aligned corpora are an excellent source of
translation-related knowledge in statistical machine
translation.
– E.g., translation lexicons, transfer rules
• Word-alignment problem
– Conventional approaches usually used co-occurrence models
• E.g., Ø2 (Gale & Church 1991), log-likelihood ratio (Dunning 1993)
– Indirect association problem: Melamed (2000) proposed
competitive linking along with an explicit noise model to solve
B(links(u, v) | cooc(u, v), )
scoreB(u, v)  log
B(links(u, v) | cooc(u, v), )
CISCO System Inc.
思科 系統
• To propose a probabilistic word-alignment model which
allows easy integration of context-specific features.
noise
Probabilistic Word-Alignment Model
• Given E = e1, e2, …, em , F = f1, f2, …, fn
– If ei and fj are translation pair, then link l(ei, fj) exists
– If ei has no corresponding translation, then null link l(ei, f0)
exists
– If fj has no corresponding translation, then null link l(e0, fj)
exists
– An alignment A is a set of links such that every word in E and
F participates in at least one link
• Alignment problem is to find alignment A to maximize
P(A|E, F)
• IBM’s translation model: maximize P(A, F|E)
Probabilistic Word-Alignment Model (Cont.)
• Given A = {l1, l2, …, lt}, where lk = l(eik, fjk), then
j
consecutive subsets of A, li = {li, li+1, …, lj}
P( A | E, F ) 
P(l1t
t
| E, F )   P(lk | E, F , l1k 1 )
k 1
k-1
• Let Ck= {E, F, l1 } represent the context of lk
P(lk | Ck ) 
P(lk , Ck ) P(Ck | lk ) P(lk )

P(Ck )
P(Ck , eik , f jk )
P(lk , eik , f jk )
P(Ck | lk )


P(Ck | eik , f jk )
P(eik , f jk )
P(Ck | lk )
 P(lk | eik , f jk ) 
P(Ck | eik , f jk )
 P(Ck , eik , f jk )  P(Ck )  P(eik , f jk | Ck )

 P(eik , f jk | Ck )  1
 P(lk , eik , f jk )  P(lk )  P(eik , f jk | lk )

 P(eik , f jk | lk )  1
Probabilistic Word-Alignment Model (Cont.)
k-1
• Ck = {E, F, l1 } is too complex to estimate
• FTk is a set of context-related features such
that P(lk|Ck) can be approximated by
P(lk|eik, fjk, FTk)
• Let Ck’ = {eik, fjk} ∪ FTk
P(lk | Ck' ) 
P(lk | eik , f jk ) 
 P(lk | eik , f jk ) 
P(Ck' | lk )
P(Ck' | eik , f jk )
P( FTk | lk )
P( FTk | eik , f jk )

P( ft | lk ) 

P( A | E , F )   P(lk | eik , f jk )  


k 1
ftFTk P ( ft | eik , f jk ) 
t
An Illustrative Example
Word-Alignment Algorithm
• Input: E, F, TE
– TE is E’s dependency tree which
enable us to make use of
features and constraints
based on linguistic intuitions
• Constraints
– One-to-one constraint: every
word participates in exactly one
link
– Cohesion constraint: use TE to
induce TF with no crossing
dependencies
Word-Alignment Algorithm (Cont.)
• Features
– Adjacency features fta: for
any word pair (ei, fj), if a link
l(ei’, fj’) exists where
-2  i’-i  2 and -2  j’-j  2,
then fta(i-i’, j-j’, ei’) is active
for this context.
– Dependency features ftd: for
any word pair (ei, fj), let ei’
be the governor of ei ,and
let rel be the grammatical
relationship between them.
If a link l(ei’, fj’) exists, then
ftd(j-j’, rel) is active for this
context.
 fta ( 1,1, host)
 pair (the1 , l ' )  
 ftd ( 1, det)
 pair (the1 , les)  ftd ( 3, det)
Experimental Results
• Test bed: Hansard corpus
– Training: 50K aligned pairs of sentences (Och & Ney 2000)
– Testing: 500 pairs
Future Work
• The alignment algorithm presented here is
incapable of creating alignments that are not
one-to-one, many-to-one alignment will be
pursued.
• The proposed model is capable of creating
many-to-one alignments as the null
probabilities of the words added on the
“many” side.
Download