Statistical Machine Translation Alona Fyshe

advertisement
Statistical Machine
Translation
Alona Fyshe
Based on slides from Colin Cherry and Dekang Lin
Basic statistics
• 0 <= P(x) <=1
• P(A)
 Probability that A happens
• P(A,B)
 Probabiliy that A and B happen
• P(A|B)
 Probability that A happens given that we know
B happened
Basic statistics
• Conditional probability
P(A,B)
P(A | B) 
P(B)

Basic Statistics
• Use definition of conditional probability to
derive the chain rule
P(A,B)
P(A | B) 
P(B)
P(A,B)  P(B)P(A | B)  P(A)P(B | A)
P(A1, A2 , An )
 P(An | An1, A1 )P(An1, A1 )

 P(A1 )P(A2 | A1 )P(A3 | A1, A2 )
P(An | A1
, An1 )
Basic Statistics
• Bayes Rule
P(A,B)  P(A | B)P(B)
P(A,B)  P(B | A)P(A)
P(A | B)P(B)  P(B | A)P(A)
P(B | A)P(A)
P(A | B) 
P(B)
Basic Statistics
• Just remember
 Definition of cond. prob.
P(A,B)
P(A | B) 
P(B)
 Bayes rule

P(B | A)P(A)
P(A | B) 
P(B)
 Chain rule
P(A1)P(A2 | A1)P(A3 | A1, A2 )

P(An | A1 , An1)
Goal
• Translate.
• I’ll use French (F) into English (E) as
the running example.
Oh, Canada
• I’m Canadian
 Mandatory French class in school until grade 6
 I speak “Cereal Box French”
Gratuit
Gagner
Chocolat
Glaçage
Sans gras
Sans cholestérol
Élevé dans la fibre
Oh, Canada
Machine Translation
• Translation is easy for (bilingual) people
• Process:
 Read the text in French
 Understand it
 Write it down in English
Machine Translation
• Translation is easy for (bilingual) people
• Process:
 Read the text in French
 Understand it
 Write it down in English
Machine Translation
Understanding language
Writing well formed text
• Hard tasks for computers
 The human process is invisible, intangible
One approach: Babelfish
• A rule-based approach to
machine translation
• A 30-year-old feat in Software Eng.
• Programming knowledge in by hand
is difficult and expensive
Alternate Approach: Statistics
• We are trying to model P(E|F)
 I give you a French sentence
 You give me back English
• How are we going to model this?
 We could use Bayes rule:
P(F | E)P(E)
P(E | F) 
 P(F | E)P(E)
P(F)
Alternate Approach: Statistics
P(F | E)P(E)
P(E | F) 
 P(F | E)P(E)
P(F)
Given a French sentence F, we could do a
search for an E that maximizes P(E | F)
Why Bayes rule at all?
• Why not model P(E|F) directly?
• P(F|E)P(E) decomposition allows us to be
sloppy
 P(E) worries about good English
 P(F|E) worries about French that matches English
 The two can be trained independently
Crime Scene Analogy
• F is a crime scene. E is a person who may have
committed the crime
 P(E|F) - look at the scene - who did it?
 P(E) - who had a motive? (Profiler)
 P(F|E) - could they have done it? (CSI transportation, access to weapons, alabi)
• Some people might have great motives, but no
means - you need both!
On voit Jon à la télévision
good English? P(E)
good match to
French? P(F|E)

Jon appeared in TV.
It back twelve saw.

In Jon appeared TV.
Jon is happy today.

Jon appeared on TV.


TV appeared on Jon.
Jon was not happy.


Table borrowed from Jason Eisner
On voit Jon à la télévision
good English? P(E)
good match to
French? P(F|E)

Jon appeared in TV.
It back twelve saw.

In Jon appeared TV.
Jon is happy today.

Jon appeared on TV.


TV appeared on Jon.
Jon was not happy.


Table borrowed from Jason Eisner
I speak English good.
• How are we going to model good English?
• How do we know these sentences are not good
English?
 Jon appeared in TV.
 It back twelve saw.
 In Jon appeared TV.
 TV appeared on Jon.
 Je ne parle pas l'anglais.
I speak English good.
• Je ne parle pas l'anglais.
 These aren’t English words.
• It back twelve saw.
 These are English words, but it’s jibberish.
• Jon appeared in TV.
 “appeared in TV” isn’t proper English
I speak English good.
• Let’s say we have a huge collection of
documents written in English
 Like, say, the Internet.
• It would be a pretty comprehensive list of
English words
 Save for “named entities” People, places, things
 Might include some non-English words
 Speling mitsakes! lol!
• Could also tell if a phrase is good English
Google, is this good English?
• Jon appeared in TV.
 “Jon appeared” 1,800,000 Google results
 “appeared in TV” 45,000 Google results
 “appeared on TV” 210,000 Google results
• It back twelve saw.
 “twelve saw” 1,100 Google results
 “It back twelve” 586 Google results
 “back twelve saw” 0 Google results
• Imperfect counting… why?
Google, is this good English?
• Language is often modeled this way
 Collect statistics about the frequency of words
and phrases
 N-gram statistics
 1-gram = unigram
 2-gram = bigram
 3-gram = trigram
 4-gram = four-gram
 5-gram = five-gram
Google, is this good English?
• Seriously, you want to query google for
every phrase in the translation?
• Google created and released a 5-gram
data set.
 Now you can query Google locally
 (kind of)
Language Modeling
• What’s P(e)?
 P(English sentence)
 P(e1, e2, e3 … ei)
 Using the chain rule
P(e1)P(e2 | e1)P(e3 | e1,e2 )P(e4 | e1,e2,e3 )
P(ei | e1,e2, ei1)

Language Modeling
P(e1)P(e2 | e1)P(e3 | e1,e2 )P(e4 | e1,e2,e3 )
P(ei | e1,e2, ei1)
• Markov assumption
 The choice of word ei depends only on the n
words before ei
P(ei | e1,e2, ei4 ,ei3,ei2,ei1)  P(ei | ei4 ,ei3,ei2,ei1)
• Definition of conditional probability
P(ei4 ,ei3 ,ei2,ei1,ei )
P(ei | ei4 ,ei3 ,ei2,ei1) 
P(ei4 ,ei3,ei2 ,ei1)
Language Modeling
P(I,love,to,eat, pie)
P( pie | I,love,to,eat) 
P(I,love,to,eat)
Language Modeling
P(ei4 ,ei3 ,ei2,ei1,ei )
P(ei4 ,ei3,ei2 ,ei1)
• Approximate probability using counts

P(ei4 ,ei3 ,ei2,ei1,ei ) C(ei4 ,ei3,ei2 ,ei1,ei )

P(ei4 ,ei3,ei2 ,ei1)
C(ei4 ,ei3 ,ei2,ei1 )

• Use the n-gram corpus!
Language Modeling
• Use the n-gram corpus!
P(I,love,to,eat, pie)
P( pie | I,love,to,eat) 
P(I,love,to,eat)
C(I,love,to,eat, pie)

C(I,love,to,eat)
2,760

409,000
 0.0067

 Not surprisingly, given that you love to eat, loving to
eat chocolate is more probable (0.177)
Language Modeling
• But what if
C(ei4 ,ei3,ei2,ei1,ei )  0
• Then P(e) = 0
• Happens even if the sentence is
 grammatically correct
 “Al Gore’s pink Hummer was stolen.”
Language Modeling
• Smoothing
 Many techniques
• Add one smoothing
 Add one to every count
 No more zeros, no problems
• Backoff
 If P(e1, e2, e3, e4, e5) = 0 use something
related to P(e1, e2, e3, e4)
Language Modeling
• Wait… Is this how people “generate”
English sentences?
 Do you choose your fifth word based on B
• Admittedly, this is an approximation to
process which is both
 intangible and
 hard for humans themselves to explain
• If you disagree, and care to defend
yourself, consider a PhD in NLP
Back to Translation
• Anyway, where were we?
 Oh right…
P(F | E)P(E)
P(E | F) 
 P(F | E)P(E)
P(F)
 So, we’ve got P(e), let’s talk P(f|e)
Where will we get P(F|E)?
Machine
Learning
Magic
Cereal boxes
in English
Same cereal
Boxes,
in French
P(F|E) model
Where will we get P(F|E)?
Machine
Learning
Magic
Books in
English
Same books,
in French
P(F|E) model
We call collections stored in two languages parallel
corpora or parallel texts
Want to update your system? Just add more text!
Translated Corpora
• The Canadian Parliamentary Debates
 Available in both French and English
• UN documents
 Available in Arabic, Chinese, English, French,
Russian and Spanish
Problem:
• How are we going to generalize from examples
of translations?
• I’ll spend the rest of this lecture telling you:
 What makes a useful P(F|E)
 How to obtain the statistics needed for P(F|E) from
parallel texts
Strategy: Generative Story
• When modeling P(X|Y):
 Assume you start with Y
 Decompose the creation of X from Y into
some number of operations
 Track statistics of individual operations
 For a new example X,Y: P(X|Y) can be
calculated based on the probability of the
operations needed to get X from Y
What if…?
The quick fox jumps over the lazy dog
Le renard rapide saut par - dessus le chien parasseux
New Information
• Call this new info a word alignment (A)
• With A, we can make a good story
The quick fox jumps over the lazy dog
Le renard rapide saut par - dessus le chien parasseux
P(F,A|E) Story
null The quick fox jumps over the lazy dog
P(F, A | E)  ?
P(F,A|E) Story
null The quick fox jumps over the lazy dog
f1
f2
f3
…
f10
P(F, A | E)  
Simplifying assumption: Choose the length of the French
sentence f. All lengths have equal probability 
P(F,A|E) Story
null The quick fox jumps over the lazy dog
f1
f2
f3
P(F, A | E) 
…
f10

(8  1)10
There are (l+1)m = (8+1)10 possible alignments
P(F,A|E) Story
null The quick fox jumps over the lazy dog
Le renard rapide saut par - dessus le chien parasseux
pt (Le | The) 



 pt (renard | fox)  
P(F, A | E)  10

9  


pt ( parasseux | lazy)
P(F,A|E) Story
null The quick fox jumps over the lazy dog
Le renard rapide saut par - dessus le chien parasseux
P(F, A | E) 

(l  1)
m

m
j1
pt ( f j | ea j )

Getting Pt(f|e)
• We need numbers for Pt(f|e)
• Example: Pt(le|the)
 Count lines in a large collection of
aligned text
#(le,the)
Pt (le | the) 
#(le,the)#(la,the)#(les,the)
null The quick fox jumps over the lazy dog
Le renard rapide saut par - dessus le chien parasseux
null The quick fox jumps over the lazy dog
Le renard rapide saut par - dessus le chien parasseux
null The quick fox jumps over the lazy dog
Le renard rapide saut par - dessus le chien parasseux
null The quick fox jumps over the lazy dog
Le renard rapide saut par - dessus le chien parasseux
null The quick fox jumps over the lazy dog
Le renard rapide saut par - dessus le chien parasseux
# e linked to f
Pt ( f | e) 
# e linked to anything
null The quick fox jumps over the lazy dog
Le renard rapide saut par - dessus le chien parasseux
Where do we get the lines?
• That sure looked like a lot of monkeys…
• Remember: some times the information hidden
in the text just jumps out at you
 We’ll get alignments out of unaligned text by treating
the alignment as a hidden variable
 We infer an A that maxes the prob. of our corpus
 Generalization of ideas in HMM training: called EM
Where’s “heaven” in Vietnamese?
Example borrowed from Jason Eisner
Where’s “heaven” in
Vietnamese?
English:
In the beginning God created the heavens and
the earth.
Vietnamese: Ban dâu Dúc Chúa Tròi dung nên tròi dât.
English:
God called the expanse heaven.
Vietnamese: Dúc Chúa Tròi dat tên khoang không la tròi.
… you are this day like the stars of heaven in
number.
Vietnamese: … các nguoi dông nhu sao trên tròi.
English:
Example borrowed from Jason Eisner
Where’s “heaven” in
Vietnamese?
English:
In the beginning God created the heavens and
the earth.
Vietnamese: Ban dâu Dúc Chúa Tròi dung nên tròi dât.
English:
God called the expanse heaven.
Vietnamese: Dúc Chúa Tròi dat tên khoang không la tròi.
… you are this day like the stars of heaven in
number.
Vietnamese: … các nguoi dông nhu sao trên tròi.
English:
Example borrowed from Jason Eisner
EM: Expectation Maximization
• Assume a probability distribution (weights)
over hidden events
 Take counts of events based on this
distribution
 Use counts to estimate new parameters
 Use parameters to re-weight examples.
• Rinse and repeat
Alignment Hypotheses
0.65 null I like milk
0.25 null I like milk
0.05
null I like milk
Je aime le lait
Je aime le lait
Je aime le lait
0.01 null I like milk
0.01 null I like milk
0.01
null I like milk
Je aime le lait
Je aime le lait
null I like milk
0.01 null I like milk
0.001
Je aime le lait
Je aime le lait
Je aime le lait
Weighted Alignments
• What we’ll do is:
 Consider every possible alignment
 Give each alignment a weight - indicating how
good it is
P(F, A | E)
P(A | E,F) 
P(F | E)
 Count weighted alignments as normal

Good grief! We forgot about P(F|E)!
• No worries, a little more stats gets us what
we need:
P(F | E) 
 P(F, A | E)
A A
P(F, A | E)
 P(A | E,F) 
 P(F, A | E)
A A
Big Example: Corpus
1
fast car
voiture rapide
2
fast
rapide
Possible Alignments
1a
fast car
voiture rapide
1b
fast car
voiture rapide
2
fast
rapide
Parameters
1a
1b
fast car
voiture rapide
P(voiture|fast)
1/2
fast car
voiture rapide
P(rapide|fast)
1/2
P(voiture|car)
1/2
2
fast
rapide
P(rapide|car)
1/2
Weight Calculations
1a
1b
fast car
voiture rapide
P(voiture|fast)
voiture rapide
P(rapide|fast)
1/2
fast car
1/2
P(voiture|car)
1/2
P(A,F|E)
P(A|F,E)
1a
1/2*1/2=1/4
1/4 / 2/4 = 1/2
1b
1/2*1/2=1/4
1/4 / 2/4 = 1/2
2
1/2
1/2 / 1/2 = 1
2
fast
rapide
P(rapide|car)
1/2
Count Lines
1a
fast car
1b
fast car
1/2 voiture rapide 1/2 voiture rapide
2
fast
1
rapide
Count Lines
1a
fast car
1b
fast car
1/2 voiture rapide 1/2 voiture rapide
#(voiture,fast)
1/2
#(rapide,fast)
1/2+1 = 3/2
#(voiture,car)
1/2
2
fast
1
rapide
#(rapide,car)
1/2
Count Lines
1a
1b
fast car
fast car
1/2 voiture rapide 1/2 voiture rapide
#(voiture,fast)
1/2
#(rapide,fast)
1/2+1 = 3/2
#(voiture,car)
1/2
2
fast
1
rapide
#(rapide,car)
1/2
Normalize
P(voiture|fast)
1/4
P(rapide|fast)
3/4
P(voiture|car)
1/2
P(rapide|car)
1/2
Parameters
1a
1b
fast car
voiture rapide
P(voiture|fast)
1/4
fast car
voiture rapide
P(rapide|fast)
3/4
P(voiture|car)
1/2
2
fast
rapide
P(rapide|car)
1/2
Weight Calculations
1a
1b
fast car
voiture rapide
P(voiture|fast)
voiture rapide
P(rapide|fast)
1/4
fast car
3/4
P(voiture|car)
1/2
P(A,F|E)
P(A|F,E)
1a
1/4*1/2=1/8
1/8 / 4/8 = 1/4
1b
1/2*3/4=3/8
3/8 / 4/8 = 3/4
2
3/4
3/4 / 3/4 = 1
2
fast
rapide
P(rapide|car)
1/2
Count Lines
1a
fast car
1b
fast car
1/4 voiture rapide 3/4 voiture rapide
2
fast
1
rapide
Count Lines
1a
fast car
1b
fast car
1/4 voiture rapide 3/4 voiture rapide
#(voiture,fast)
1/4
#(rapide,fast)
3/4+1 = 7/4
#(voiture,car)
3/4
2
fast
1
rapide
#(rapide,car)
1/4
Count Lines
1a
1b
fast car
fast car
1/4 voiture rapide 3/4 voiture rapide
#(voiture,fast)
1/4
#(rapide,fast)
3/4+1 = 7/4
#(voiture,car)
3/4
2
fast
1
rapide
#(rapide,car)
1/4
Normalize
P(voiture|fast)
1/8
P(rapide|fast)
7/8
P(voiture|car)
3/4
P(rapide|car)
1/4
After many iterations:
1a
1b
fast car
fast car
~0 voiture rapide ~1 voiture rapide
P(voiture|fast)
0.001
P(rapide|fast)
0.999
P(voiture|car)
0.999
2
fast
1
rapide
P(rapide|car)
0.001
Seems too easy?
• What if you have no 1-word sentence?
 Words in shorter sentences will get more weight fewer possible alignments
 Weight is additive throughout the corpus: if a word e
shows up frequently with some other word f, P(f|e) will
go up
 Like our heaven example
The Final Product
• Now we have a model for P(F|E)
• Test it by aligning a corpus!
 IE: Find argmaxAP(A|F,E)
• Use it for translation:
 Combine with our n-gram model for P(E)
 Search space of English sentences for one
that maximizes P(E)P(F|E) for a given F
Model could be a lot better:
• Word positions
• Multiple f’s generated by the same e
• Could take into account who generated
your neighbors
• Could use syntax, parsing
• Could align phrases
Sure, but is it any better?
• We’ve got some good ideas for improving
translation
• How can we quantify the change
translation quality?
Sure, but is it any better?
• How to (automatically) measure translation?
 Original French
Dès qu'il fut dehors, Pierre se dirigea vers la rue de Paris, la principale rue du
Havre, éclairée, animée, bruyante.
 Human translation to English
As soon as he got out, Pierre made his way to the Rue de Paris, the high-street
of Havre, brightly lighted up, lively and noisy.
 Two machine tranlations back to French:
 Dès qu'il est sorti, Pierre a fait sa manière à la rue De Paris, la haut-rue de
Le Havre, brillamment allumée, animée et bruyante.
 Dès qu'il en est sorti, Pierre s'est rendu à la Rue de Paris, de la grande rue
du Havre, brillamment éclairés, animés et bruyants.
Example from http://www.readwriteweb.com/archives/google_translation_systran.php
Bleu Score
• Bleu
 Bilingual Evaluation Understudy
 A metric for comparing translations
• Considers
 n-grams in common with the target translation
 Length of target translation
• Score of 1 is identical, 0 shares no words in
common
• Even human translations don’t score 1
Google Translate
• http://translate.google.com/translate_t
 25 language pairs
• In the news (digg.com)
 http://www.readwriteweb.com/archives/google
_translation_systran.php
• In competition
 http://www.nist.gov/speech/tests/mt/doc/mt06
eval_official_results.html
Questions?
References
(Inspiration, Sources of borrowed material)
• Colin Cherry, MT for NLP, 2005
http://www.cs.ualberta.ca/~colinc/ta/MT650.pdf
• Knight, K., Automating Knowledge Acquisition for Machine
Translation , AI Magazine 18(4), 1997.
• Knight, K., A Statistical Machine Translation Tutorial Workbook,
1999, http://www.clsp.jhu.edu/ws99/projects/mt/mt-workbook.htm
• Eisner, J., JHU NLP Course notes: Machine Translation, 2001,
http://www.cs.jhu.edu/~jason/465/PDFSlides/lect32-translation.pdf
• Olga Kubassova, Probability for NLP,
http://www.comp.leeds.ac.uk/olga/ProbabilityTutorial.ppt
Enumerating all alignments
P(F | E) 
There are


(l  1)
l  1
m
l
m
l
 
a1  0
am  0
m
j1
possible alignments!
pt ( f j | ea j )
Gah!
Null (0) Fast (1) car (2)
Voiture (1) rapide (2)
pt ( f1 | e0 ) pt ( f 2 | e0 ) 
pt ( f1 | e0 ) pt ( f 2 | e1 ) 
pt ( f1 | e0 ) pt ( f 2 | e2 ) 
pt ( f1 | e1 ) pt ( f 2 | e0 ) 
pt ( f1 | e1 ) pt ( f 2 | e1 ) 
pt ( f1 | e1 ) pt ( f 2 | e2 ) 
pt ( f1 | e2 ) pt ( f 2 | e0 ) 
pt ( f1 | e2 ) pt ( f 2 | e1 ) 
pt ( f1 | e2 ) pt ( f 2 | e2 ) 
Let’s move these over here…
Null (0) Fast (1) car (2)
Voiture (1) rapide (2)
pt ( f1 | e0 )pt ( f 2 | e0 )  pt ( f 2 | e1)  pt ( f 2 | e2 ) 
pt ( f1 | e1 )pt ( f 2 | e0 )  pt ( f 2 | e1)  pt ( f 2 | e2 ) 
pt ( f1 | e2 )pt ( f 2 | e0 )  pt ( f 2 | e1)  pt ( f 2 | e2 )

And now we can do this…
Null (0) Fast (1) car (2)
Voiture (1) rapide (2)
pt ( f1 | e0 )  pt ( f1 | e1)  pt ( f1 | e2 )
pt ( f 2 | e0 )  pt ( f 2 | e1)  pt ( f 2 | e2 )


So, it turns out:
l
l
m
  p (f
t
a1  0
a m  0 j1
m
j
l
| ea j )   pt ( f j | ei )
j1 i 0
Requires only m(l 1) operations.
Can be used whenever your alignment choice
for one word does not affect the probability of

the rest of the alignment
Download