SMT – Basic Ideas Statistical Machine Translation Stephan Vogel MT Class

advertisement
Statistical Machine Translation
SMT – Basic Ideas
Stephan Vogel
MT Class
Spring Semester 2011
Stephan Vogel - Machine Translation
1
Overview
 Deciphering foreign text – an example
 Principles of SMT
 Data processing
Stephan Vogel - Machine Translation
2
Deciphering Example
 Apinaye – English
 Apinaye belongs to the Ge family of Brazil
 Spoken by 800 (according to SIL, 1994)
 http://www.ethnologue.com/show_family.asp?subid=90784
http://www.language-museum.com/a/apinaye.php
 Example from Linguistic Olympics 2008, see
http://www.naclo.cs.cmu.edu
 Parallel Corpus
(some characters adapted)
Kukre kokoi
Ape kre
Ape kokoi rats
Ape mi mets
Ape mets kra
Ape punui mi pinjets
The monkey eats
The child works
The big monkey works
The good man works
The child works well
The old man works badly
 Can we translate new sentence?
Stephan Vogel - Machine Translation
3
Deciphering Example
 Parallel Corpus (some characters adapted)
Kukre kokoi
The monkey eats
Ape kra
The child works
Ape kokoi rats
The big monkey works
Ape mi mets
The good man works
Ape mets kra
The child works well
Ape punui mi pinjets
The old man works badly
 Can we build a lexicon from these sentence pairs?
 Observations:
 Apinaye: Kukre (1) Ape (5),
English: The (6), works (5)
Aha! -> first guess: Ape – works
 monkey in 1, 3; child in 2, 4; man in 4, 6
different distribution over corpus: do we find words with similar distribution
on the Apinaye side?
Stephan Vogel - Machine Translation
4
… Vocabularies
Corpus
Vocabularies
Kukre kokoi
The monkey eats
Apinaye
English
Ape kra
The child works
kukre
The
Ape kokoi rats
The big monkey works
kokoi
monkey
Ape mi mets
The good man works
ape
eats
Ape mets kra
The child works well
kra
child
Ape punui mi pinjets
The old man works badly
rats
works
mi
big
mets
good
punui
man
pinjets
well
 Observations:
 9 Apinaye words, 11 English words
old
badly
 Expectations:
 English words without translation?
 Apinaye words corresponding to more then 1 English word?
Stephan Vogel - Machine Translation
5
… Word Frequencies
Corpus
Vocabularies, with frequencies
Kukre kokoi
The monkey eats
Apinaye
Ape kra
The child works
kukre
1
The
6
Ape kokoi rats
The big monkey works
kokoi
2
monkey
2
Ape mi mets
The good man works
ape
5
eats
1
Ape mets kra
The child works well
kra
2
child
2
Ape punui mi pinjets
The old man works badly
rats
1
works
5
mi
1
big
1
mets
2
good
1
punui
1
man
2
pinjets
1
well
1
old
1
badly
1
 Suggestions:
 ‘ape’ (5) could align to
‘The’ (6) or ‘works’ (5)
 More likely that content word ‘works’
has match, i.e. ‘ape’ = ‘works’
 Other word pairs difficult to predict –
too many similar frequencies
Stephan Vogel - Machine Translation
English
6
… Location in Corpus
Corpus
Vocabularies, with occurrences
Kukre kokoi
The monkey eats
Apinaye
Sentences
English
Sentences
Ape kra
The child works
kukre
1
The
123456
Ape kokoi rats
The big monkey works
kokoi
13
monkey
13
Ape mi mets
The good man works
ape
23456
eats
1
Ape mets kra
The child works well
kra
25
child
25
Ape punui mi pinjets
The old man works badly
rats
3
works
23456
mi
46
big
3
mets
45
good
4
punui
6
man
46
pinjets
6
well
5
old
6
badly
6
 Observations:
 Same sentences: ‘kukre’ – ‘eats’,
‘kokoi’ – ‘monkey’, ‘ape’ – ‘works’,
‘kra’ – ‘child’, ‘rats’ – ‘big’, ‘mi’ – ‘man’
 ‘mets’ (4 and 5) =? ‘good’ (4) and ‘well’ (5); makes sense
 ‘punui’ and ‘pinjets’ match ‘old’ and ‘badly’ – which is which?
Stephan Vogel - Machine Translation
7
… Location in Sentence
Corpus
Apinaye
English
Alignment EN - AP
Kukre kokoi
The monkey eats
1-0 2-2 3-1
Ape kra
The child works
1-0 2-2 3-1
Ape kokoi rats
The big monkey works
1-0 2-3 3-2 4-1
Ape mi mets
The good man works
1-0 2-3 3-2 4-1
Ape mets kra
The child works well
1-0 2-3 3-1 4-2
Ape punui mi pinjets
The old man works badly
1-0 2-??? 3-3 4-1 5-???
 Observations:




First English word (‘The’) does not align; we say it aligns to the NULL word
Apinaye verb in first position
English last word aligns to 1st or 2nd position
English -> Apinaye: reverse word order (not strictly in sentence pair 5)
 Hypothesis:
 alignment for last sentence pair is 1-0 2-4 3-3 4-1 5-2
I.e: ‘pinjets’ – ‘old’ and ‘punui’ – ‘badly’
Stephan Vogel - Machine Translation
8
… POS Information
Corpus
Kukre kokoi
VN
The monkey eats
DET N V
Ape kra
VN
The child works
Det N V
Ape kokoi rats
V N Adj
The big monkey works
Det Adj N V
Ape mi mets
V N Adj
The good man works
Det Adj N V
Ape mets kra
V Adv N
The child works well
Det N V Adv
Ape punui mi pinjets
V ??? N ???
The old man works badly
Det Adj N V Adv
 Observations:
 English determiner (‘The’) does not align; perhaps no determiners in Apinaye
 English Verb Adverb -> Apinaye: Verb Adverb -> no reordering
 English Adjective Noun -> Apinaye: Noun Adjective -> reordering
 Hypothesis:
 ‘pinjets’ is Adj to make it N Adj, ‘punui’ is Adv
(consistent with alignment hypothesis)
Stephan Vogel - Machine Translation
9
Translate New Sentences: Ap - En





Source Sentence:
Ape rats mi mets
Lexical information:
works big man good/well
Reordering information: The good man works big
Better lexical choice:
The good man works hard
Compare: Ape mi mets -> The good man works




Source Sentence:
Lexical information:
Reordering information:
Better lexical choice:
Kukre rats kokoi punui
eats big monkey badly
The bad monkey eats big
The bad monkey eats a lot
Stephan Vogel - Machine Translation
10
Translate New Sentences: En - Ap
 Source Sentence:
 Lexical information:
 Reordering information:
The old monkey eats a lot
NULL pinjets kokio kukre rats
kukre rats kokio pinjets





Or
Deleting words:
Rephrase:
Reorder:
Lexical information:
old monkey eats a lot
old monkey eats big
eats big monkey old
kukre rats kokio pinjets




Source Sentence:
Delete plus rephrase:
Reorder:
Lexical information:
The big child works a long time
big child works big
works big child big
Ape rats kra rats
Stephan Vogel - Machine Translation
11
Overview
 Deciphering foreign text – an example
 Principles of SMT
 Data processing
Stephan Vogel - Machine Translation
12
Principles of SMT
 We will use the same approach – learning from data
 Build translation models using frequency, co-occurrence, word
position, etc. information
 Use the models to translate new sentences
 Not manually, but fully automatically
 The training will be automatically
 The is still lots of manual work left: designing models, preparing data,
running experiments, etc.
Stephan Vogel - Machine Translation
13
Statistical versus Grammar-Based
 Often statistical and grammar-based MT are seen as alternatives,
even opposing approaches – wrong !!!
 Dichotomies are:
 Use probabilities || everything is equally likely, yes/no decision
 Rich (deep) structure || no or only flat structure
 Both dimensions are continuous
 Examples
 EBMT: no/little structure and heuristics
 SMT: (initially only) flat structure and probabilities
 XFER: deep(er) structure and heuristics
No Probs
Probs
Flat
Structure
EBMT
SMT
Deep
Structure
XFER,
Interlingua
Holy
Grail
 Goal: structurally rich probabilistic models
 statXFER: deep structure and probabilities
 Syntax-augmented SMT:
deep structure and probabilities
Stephan Vogel - Machine Translation
16
Statistical Machine Translation
 Translator translates source text
 Use machine learning techniques
to extract useful knowledge
 Translation model: word and
phrase translations
 Language model: how likely words
follow in a particular sequence
 Translation system (decoder) uses
these models to translates new
sentences
 Advantages:
Source
Translation
Model
Target
Language
Model
 Can quickly train for new languages
 Can adopt to new domains
 Problems:
 Need parallel data
Source
 All words, even punctuation, are equal
Sentence
 Difficult to pin-point the causes of errors
Stephan Vogel - Machine Translation
Translation
17
Tasks in SMT
 Modelling
build statistical models which capture characteristic features of translation
equivalences and of the target language
 Training
train translation model on bilingual corpus, train language model on
monolingual corpus
 Decoding
find best translation for new sentences according to models
 Evaluation
 Subjective evaluation: fluency, adequacy
 Automatic evaluation: WER, Bleu, etc
 And all the nitty-gritty stuff
 Text preprocessing, data cleaning
 Parameter tuning (minimum error rate training)
Stephan Vogel - Machine Translation
18
Noisy Channel View
“French is actually English, which has been garbled during
transmission; recover the correct, original English”
Speaker speaks
English
Noisy channel distorts
into French
Stephan Vogel - Machine Translation
You hear French,
but need to recover
the English
19
Bayesian Approach
Select translations which has highest probability
ê = argmax{ p(e | f) }
= argmax{ p(e) p(f | e) }
Model
Channel
Search
Process
Model
Source
Stephan Vogel - Machine Translation
20
SMT Architecture
p(e) – language model
p(f | e) – translation model
Stephan Vogel - Machine Translation
21
Log-Linear Model
 In practice: ê = argmax{ log(p(e)) + log( p(f | e)) }
 Translaiton model (TM) and language model (LM) may be of
different quality:
 - simplifying assumptions
 - trained on different abounts of data
 Give different weights to both models
ê = argmax{ w1 * log(p(e)) + w2 * log( p(f | e)) }
 Why not add more features?
ê = argmax{ w1 * h1(e,f) + ... wn * hn(e, f) }
 Note: We don‘t need the normalization constant for the argmax
Stephan Vogel - Machine Translation
22
Overview
 Deciphering foreign text – an example
 Principles of SMT
 Data processing
Stephan Vogel - Machine Translation
23
Corpus Statistics
We want to know how much data
 Corpus size: not file size, not documents, but words and
sentences
 Why is file size not important?
 Vocabulary: number of word types
We want to know some distributions
 How many words are seen only once?
 Why is this interesting?
 Does it help to increase the corpus?
 …
 How long are the sentence
 Does it matter if we have many short of fewer, but longer sentences?
Stephan Vogel - Machine Translation
24
All Simple, Basic, Important
 Important: When you publish, these numbers are important
 To be able to interpret the results
E.g. what works on small corpora may not work on large corpora
 To make them comparable to other papers
 Basic: no deep thinking, no fancy
 Simple: a few unix commands, a few simple scripts
 wc, grep, sed, sort, uniq
 perl, awk (my favorite), perhaps python, …
 Let’s look at some data!
Stephan Vogel - Machine Translation
25
BTEC Spa-Eng
 Corpus Statistics
 Corpus and vocabulary size
 Percentage of singletons
 Number of unknown words, out-of-vocabulary (OOV) rate
 Sentence length balance
 Text normalization
 Spoken language forms: I’ll, we’ar, but also I will, we are
Note: this was shown online
Stephan Vogel - Machine Translation
26
Tokenization
 Punctuation attached to words
 Example: ‘you’ ‘you,’ ‘you.’ ‘you?’
 All different strings, i.e. all are different words
 Tokenization can be tricky
 What about punctuation in numbers
 What about appreviations(A5-0104/1999)
 Numbers are not just numbers




Percentages: 1.2%
Ordinals: 1st, 2.
Ranges: 2000-2006, 3:1
And more: (A5-0104/1999)
Stephan Vogel - Machine Translation
27
GigaWord Corpus
 Distributed by LDC
 Collection of new papers: NYT, Xinhua News, …
 > 3 billion words
 How large is vocabulary?
 Some observations in vocabulary
 Number of entries with digits
 Number of entries with special characters
 Number of strange ‘words’
 Some observations in corpus
 Sentences with lots of numbers
 Sentences with lots of punctuation
 Sentences with very long words
Note: this was shown online
Stephan Vogel - Machine Translation
28
And then the more interesting Stuff
 POS tagging
 Parsing
 For syntax-based MT systems
 How parallel are the parse trees?
 Word segmentation
 Morphological processing
In all these tasks the central problem is:
How to make the corpus more parallel?
Stephan Vogel - Machine Translation
29
Download