+ (w i-1 ) - VideoLectures.NET

advertisement
The State of the Art in Phrase-Based
Statistical Machine Translation
(SMT)
Roland Kuhn, George Foster, Nicola Ueffing
February 2007
Tutorial Plan
A. Overview
B. Details & research topics
NOTE:
best overall reference for SMT hasn’t been
published yet – Philipp Koehn’s « Statistical Machine
Translation » (to be published by Cambridge University
Press). Some of the material presented here is from a
draft of that book.
Tutorial Plan
A.Overview
The MT Task & Approaches to it
Examples of SMT output
SMT Research: Culture, Evaluations, & Metrics
SMT History: IBM Models
Phrase-based SMT
Phrase-Based Search
Loglinear Model Combination
Target Language Model P(T)
Flaws of Phrase-based, Loglinear Systems
PORTAGE: a Typical SMT System
The MT Task &
Approaches to it
• Core MT task: translate a sentence from a source
language S to target language T
• Conventional expert system approach: hire experts to
write rules for translating S to T
• Statistical approach: using a bilingual text corpus (lots
of S sentences & their translations into T), train a
statistical translation model that will map each new S
sentence into a T sentence
The MT Task &
Approaches to it
Statistical System
Experts
Expert System
Bilingual parallel corpus
S
T
+
+
Manually coded rules
If « … » then …
If « … » then …
……
……
Else ….
Machine
Learning
S: Mais où sont les neiges d’antan?
Expert system output
T: But where are the snows
of yesteryear?
Statistical system output
T1: But where are the snows
of yesteryear? P = 0.41
T2: However, where are
yesterday’s snows? P = 0.33
T3: Hey - where did the old
snow go? P = 0.18
…
Statistical rules
P(but | mais)=0.7
P(however | mais)=0.3
P(where | où)=1.0
……
The MT Task &
Approaches to it
“Expert” vs. “Statistical” systems
• Expert systems incorporate deep linguistic knowledge
• They still yield top performance for well-studied language pairs in
non-specialized domains
• Computationally cheap (compared to statistical MT)
BUT • Brittle
• Expensive to maintain (messy software engineering)
• Expensive to port to new semantic domains or new language
pairs
• Typically yield only one T sentence for each S sentence
The MT Task &
Approaches to it
“Expert” vs. “Statistical” systems
• More E-text, better algorithms, stronger machines  quality of
SMT output approaching that of expert systems
• Statistical approach has beaten expert systems in related areas e.g., automatic speech recognition
• SMT is robust (does well on frequent phenomena)
• Easy to maintain
• Easily ported to new semantic domain or new language pairs – IF
training corpora available
• For each S sentence, yields many T sentences (each with a
probabilistic score) – useful for semi-supervised translation
The MT Task &
Approaches to it
Structure of Typical SMT System
Bilingual parallel corpus
S: Mais où sont les neiges d’antan?
S
Extra Target Corpora
T
offline training
Preprocessor
Phrase Translation
Model
(optional extra LM
training corpora)
Target Language
Model
mais où sont les neiges d’ antan ?
Initial N-best hypotheses
T1: however where are the snows
#d’ antan# P = 0.22
T2: but where are the snows
#d’ antan# P = 0.21
T3: but where did the #d’ antan#
snow go P = 0.13
…
Decoder
Other Knowledge Final N-best hypotheses
T1: But where are the snows
Sources
of yesteryear? P=0.41
T2: However, where are
yesterday’s snows? P = 0.33
Reordering
Postprocessor
…
The MT Task &
Approaches to it
Commercial Systems
• Systran, biggest MT company, uses expert systems; so do most MT
companies. However, Systran has recently begun exploring possibility of
adding a statistical component to their system.
• Important exception: LanguageWeaver, new company based on SMT
(closely linked to researchers at ISI, U. Southern California)
• Google has superb SMT research team – but online, they still mainly
use Systran (probably because of computational cost of online SMT).
Seem to be gradually swapping in SMT systems for language pairs with
lower traffic.
Examples of SMT output
Chinese → English output:
REF: Hong Kong citizens jumped for joy when they knew Beijing's bid for 2008 Olympic games was successful.
PORTAGE Dec. 2004: The public see that Beijing's hosting of the Olympic Games in 2008 excited.
PORTAGE Nov. 2006: Hong Kong people see Beijing's successful bid for the 2008 Olympic Games, very happy.
REF: The U.S. delegation includes a China expert from Stanford University, two Senate foreign policy aides and a former
State Department official who has negotiated with North Korea.
PORTAGE Dec. 2004: The United States delegation comprising members from the Stanford University, one of the
Chinese experts, two of the Senate foreign policy as well as assistant who was responsible for dealing with Pyongyang
authorities of the former State Department officials.
PORTAGE Nov. 2006: The US delegation included members from Stanford University and an expert on China, two
Senate foreign policy, and one who is responsible for dealing with Pyongyang authorities, a former State Department
officials.
REF: Kuwait foreign minister Mohammad Al Sabah and visiting Jordan foreign minister Muasher jointly presided the first
meeting of the joint higher committee of the two countries on that day.
PORTAGE Dec. 2004: Kuwaiti Foreign Secretary Sabah on that day and visiting Jordan Foreign Secretary maasher cochaired the section about the two countries mixed Committee at the inaugural meeting.
PORTAGE Nov. 2006: Kuwaiti Foreign Minister Sabah day and visiting Jordanian Foreign Minister of Malaysia, cochaired by the two countries, the joint commission met for the first time.
REF: The Beagle 2 was scheduled to land on Mars on Christmas Day, but its signal is still difficult to pin down.
PORTAGE Dec. 2004: small dog meat, originally scheduled for Christmas landing Mars, but it is a signal remains elusive.
PORTAGE Nov. 2006: 2 small dog meat for Christmas landing on Mars, but it signals is still unpredictable.
Examples of SMT output
And a silly English → German example from Google (Jan. 25, 2007):
the hotel has a squash court

das Hotel hat ein Kürbisgericht
(think “zucchini tribunal”)
* but this kind of error – perfect syntax, never-seen word combination –
isn’t typical of a statistical system, so this was probably a rule-based
system
SMT Research: Culture,
Evaluations, & Metrics
Culture
• SMT research is very engineering-oriented; driven by
performance in NIST & other evaluations (see later slides)
 if a heuristic yields a big improvement in BLEU scores & a
wonderful new theoretical approach doesn’t, expect the former to
get much more attention than the latter
• Advantages of SMT culture: open-minded to new ideas that can
be tested quickly; researchers who count have working systems
with reasonably well-written software (so they can participate in
evaluations)
• Disadvantages of SMT culture: closed-minded to ideas not tested
in a working system
 if you have a brilliant theory that doesn’t show a BLEU score
improvement in a reasonable baseline system, don’t expect SMT
researchers to read your paper!
SMT Research: Culture,
Evaluations, & Metrics
The NIST MT Evaluations
• Since 2001, US National Institute of Standards & Technology (NIST) has been
evaluating MT systems
• Participants include MIT
, IBM
, CMU
, RWTH
,
Hong Kong UST
, ATR
, IRST
, others …
- and NRC:
NRC’s system is called PORTAGE (in NIST evaluation 2005 & 2006).
• Main NIST language pairs: ChineseEnglish, ArabicEnglish
• Semantic domains: news stories & multigenre
• Training corpora released each fall, test corpus each spring; participants have
1 working week to submit target sentences
• NIST evaluates systems comparatively
In 2005 http://www.nist.gov/speech/tests/mt/mt05eval_official_results_release_20050801_v3.html
& 2006 http://www.nist.gov/speech/tests/mt/mt06eval_official_results.html
statistical systems beat expert systems according to BLEU metric
SMT Research: Culture,
Evaluations, & Metrics
Other MT Evaluations
• WPT/WMT usually organized each spring by Philipp Koehn &
Christoph Monz – smaller training corpora than NIST, European
language pairs. In 2006, evaluated on French <-> English, German <> English, Spanish <->English. http://www.statmt.org/wmt06/proceedings/
• TC-STAR Evaluation for spoken language translation. In 2006,
evaluated on Chinese->English (one direction only) and
Spanish <->English http://www.elda.org/tcstar-workshop/2006eval.htm
• IWSLT Evaluation for spoken language translation. In 2006, evaluated
on Arabic->English, Chinese->English, Italian->English,
Japanese->English http://www.slt.atr.jp/IWSLT2006_whatsnew/index.html
SMT Research: Culture,
Evaluations, & Metrics
GALE Project
• Huge DARPA-sponsored project: $50 million per year for 5
years. Three consortia: BBN-led « Agile », IBM-led
« Rosetta », SRI-led « Nightingale ».
• NRC team is in MT working group of Nightingale.
(Arabic or
Chinese) speech
Automatic speech recognition (ASR)
(Arabic or Chinese)
documents
(Arabic or Chinese)
transcriptions
Machine
translation (MT)
English text
Distillation
IR/database
component
SMT Research: Culture,
Evaluations, & Metrics
What is BLEU?
• Human evaluation of automatic translation quality hard &
expensive. BLEU metric (invented at IBM) compares MT output
with human-generated reference translations via N-gram
matches.
• N-gram precision = # (N-grams in MT output seen in ref.)
# (N-grams in MT output)
• Example (from P. Koehn):
REF = Israeli officials are responsible for airport security
1-gram
Sys
A
=
Israeli
officials
responsibility
of
airport
safety
2-gram
match
matches Sys B = airport security Israeli officials are responsible
4-gram
match
SMT Research: Culture,
Evaluations, & Metrics
What is BLEU?
• REF = Israeli officials are responsible for airport security
Sys A = Israeli officials responsibility of airport safety
Sys B = airport security Israeli officials are responsible
• Sys A: 1-gram precision = 3/6 (Israeli, officials, airport);
2-gram precision = 2/5 (Israeli officials);
3-gram precision = 0/4 = 4-gram precision = 0/3.
Sys B: 1-gram precision = 6/6; 2-gram precision = 4/5;
3-gram precision = 2/4; 4-gram precision = 1/3.
• BLEU-N multiplies together the N N-gram precisions – the higher
the value, the better the translation. But, could cheat by having
very few words in MT output – so, brevity penalty.
SMT Research: Culture,
Evaluations, & Metrics
What is BLEU?
BLEU-N = (brevity-penalty)*Πi=1N(precisioni)i, where
brevity-penalty = min(1,output-length/ref-length) .
Usually, we set N=4 and all i = 1, so we have
BLEU-4 = (min(1,output-length/ref-length))*Πi=14precisioni.
• If any MT output has no N-grams matching ref., for some N=1,
…, 4, BLEU-4 is zero. So, normally compute BLEU over whole
test set of at least a hundred or so sentences.
• Multiple references: if an N-gram has K occurrences in output,
look for single ref. that has K or more copies of that N-gram. If
find such a single ref., that N-gram has matched K times. If not,
look for a ref. that has the highest # of copies (L) of that N-gram;
use L in precision calculation. Ref-length = closest length.
SMT Research: Culture,
Evaluations, & Metrics
Quality score: 0 = terrible, 3 = excellent
Does BLEU correlate with human judgment?
Translator Identity
* BLEU kind of correlates with human judgment ; works best
with multiple references.
SMT Research: Culture,
Evaluations, & Metrics
Why BLEU Is Controversial
• If system produces a brilliant translation that uses many Ngrams not found in the references, it will receive a low score.
• Proponents of the expert system approach argue that BLEU is
biased against this approach, & favours SMT
• Partial confirmation:
1. in NIST 2006 Arabic-to-English evaluation, AppTek hybrid
system (rule-based + SMT system) did best according to
human evaluators, but not according to BLEU.
2. in 2006 WMT evaluation Systran was scored comparably to
other systems for some European language pairs (e.g.,
French-English) by human evaluators, but had much lower indomain BLEU scores (see graphs in
http://www.statmt.org/wmt06/proceedings/pdf/WMT14.pdf).
SMT Research: Culture,
Evaluations, & Metrics
Other Automatic Metrics
• SMT systems need an automatic metric for tuning (must try
out thousands of variants). Automatic metrics compare MT
output with human-generated reference translations.
• Rivals of BLEU:
* translation edit rate (TER) – how many edit ops to match
references? http://www.cs.umd.edu/~snover/pub/amta06/ter_amta.pdf
* METEOR – compares MT output with references in way
that’s less dependent on word choice (via stemming,
WordNet, etc.) Gaining credibility: correlates better than
BLEU with human scores. However,
METEOR only defined for translation into English.
http://www.cs.cmu.edu/~alavie/METEOR/.
SMT Research: Culture,
Evaluations, & Metrics
Manual Metrics
• Human evaluation of SMT preferable to automatic evaluation, but
much slower & more expensive. Can’t use for system tuning.
• Ask humans to rank systems by adequacy and fluency. Adequacy:
does MT output convey same meaning as source?
Fluency: does MT output look like normal target-language text?
(Good syntax & idiom).
• Metrics based on human postediting of MT output. E.g., HTER.
• Metrics based on human understanding of MT output. Related to
adequacy, but less subjective. E.g., Lincoln Labs metric: give English
output of Arabic MT system to unilingual English analyst, then test
him with standard « Defense Language Proficiency Test » (see
Jones05).
SMT Research: Culture,
Evaluations, & Metrics
Who Uses Which Metric When?
• Many groups use BLEU for automatic system tuning
• NIST, WPT/WMT, TC-STAR, & other evaluations often have
BLEU as official metric, with some human reality checks.
Koehn & Monz WPT/WMT: participants do human
fluency/adequacy evaluations - nice analyses!
• Many « expert/rule-based MT » researchers hate BLEU (can
become excuse not to evaluate system competitively)
• In theory, manual metrics should be related to MT task: e.g.,
adequacy for browsing/gisting, Lincoln Labs metric for
intelligence community, HTER if MT output will be post-edited.
So why is HTER GALE’s official metric? HTER = Human
Translation Edit Rate: MT output hand-edited by humans;
measure # of operations performed.
SMT History: IBM Models
• In the late 1980s, members of IBM’s speech recognition group
applied statistical learning techniques to bilingual corpora. These
American researchers worked mainly with the Canadian Hansard
– bilingual transcription of parliamentary proceedings.
• These researchers quit IBM around 1991 for a hedge fund,
Renaissance Technologies – they are now very rich!
• Renewed interest in their work sparked the revival of research into
statistical learning for MT that occurred from late 1990s onward.
Newer « phrase-based » approach still partially relies on IBM
models.
• The IBM approach used Bayes’s Theorem to define the
« Fundamental Equation » of MT (Brown et al. 1993)
SMT History: IBM Models
Fundamental Equation of MT
The best-fit translation of a source-language (French) sentence S
into a target-language (English) sentence T is:
^ = argmax [P(T)*P(S|T)]
T
T
search task
language model word translation model
Job of language model: ensure well-formed target-language T
Job of translation model: ensure T could have generated S
Search task: find T maximizing product P(T)*P(S|T)
SMT History: IBM Models
• The IBM researchers defined five statistical translation models
(numbered in order of complexity)
• Each defines a mechanism for generation of text in one language
(e.g., French or foreign = F) from another (e.g., English = E)
• Most general many-to-many case is not covered by IBM models;
in this forbidden case, a group of E words generates a group of F
words, e.g. :
The poor don’t have any money
Les
pauvres
sont démunis
SMT History: IBM Models
• The IBM models only allow one-to-many generation, e.g.:
And
Ø
the
Le
program has been implemented
programme
a
été
mis en application
• IBM models 1 & 2 – all lengths for F sentence equally likely
• Model 1 is « bag of words » - word order in F & E doesn’t matter
• In model 2, chance that an E word generates given F word(s)
depends on position
• IBM models 3, 4, & 5 are fertility-based
SMT History: IBM Models
IBM model 1: « bag of words »
f1
e1
(draw with uniform
probability)
e2
….
eL
f2
….
fM
P(L→M)
IBM model 2: « position-dependent bag of words »
e1
e2
….
eL
P(1 →1)
P(1→M)
f1
P(2 →1)
P(2 →M)
P(L→1)
P(L→M)
….
(draw with positiondep. prob)
….
f2
….
fM
SMT History: IBM Models
Parameters: φ(ei) = fertility of ei = prob. will produce
0, 1, 2 … words in F; t(f|ei) = probability that ei can generate f;
Π(j | i, k) = distortion prob. = prob. that kth word generated by ei ends up in pos. j of F
IBM model 3
φ(e1)
φ(e2)
Distortion model Π
P(1→1), P(1→2),
…, P(M→M)
2
t
e1
t
0
e2
φ(eL) ….
Ø
NOTE: phrases can be broken up,
but with lower prob. than in model 3
fM
….
1
t
e1
f2
φ(e2)
f2
….
0
e2
….
(phrase)
Ø
φ(eL)
1
eL
t
fM
f1
Π
φ(e1)
3
f1
Distortion model
IBM model 4
eL
t
f1
fM
f2
f1
f3
….
fM
f2
f3
IBM model 5: cleaned-up version of model 4 (e.g., two F words can’t be
given same position)
Phrase-based SMT
Four key ideas
• phrase-based models (Och04, Koehn03, Marcu02)
• dynamic programming search algorithms (Koehn04)
• loglinear model combination (Och02)
• error-driven learning (Och03)
Phrase-based SMT
Example: « cul de sac »
Phrase-based approach introduced around 1998 by
Franz Josef Och & others (Ney, Wong, Marcu):
many-words-to-many-words (improvement on IBM one-to-many)
word-based translation = « ass of bag » (N. Am), « arse of bag » (British)
phrase-based translation = « dead end » (N. Am.), « blind alley » (British)
This knowledge is stored in a phrase table : collection of conditional
probabilities of form P(S|T) = backward phrase table or
P(T|S) = forward phrase table. Recall Bayes:
^
T = argmaxT [P(T)*P(S|T)]  backward table essential,
forward table used for heuristics. Tables for French->English:
backward: P(S|T)
p(sac|bag) = 0.9
p(sacoche|bag) = 0.1
…
p(cul de sac|dead end) = 0.7
p(impasse|dead end) = 0.3
…
forward: P(T|S)
p(bag|sac) = 0.5
p(hand bag|sac) = 0.2
…
p(cul|ass) = 0.5
p(dead end|cul de sac) = 0.85
…
Phrase-based SMT
Overall Phrase Pair Extraction Algorithm
1. Run a sentence aligner on a parallel bilingual corpus (won’t go
over this)
2. Run word aligner (e.g., one based on IBM models) on each
aligned sentence pair – see next slide.
3. From each aligned sentence pair, extract all phrase pairs with
no external links - see two slides ahead.
Phrase-based SMT
Symmetrized Word Alignment using IBM Models
Alignments produced by IBM models are asymmetrical: source words have at
most one connection, but target words may have many connections.
To improve quality, use symmetrization heuristic (Och00):
1. Perform two separate alignments, one in each different translation direction.
2. Take intersection of links as starting point.
3. Add neighbouring links from union until all words are covered.
S: I want to go home
T: Je veux aller chez moi
S: Je veux aller chez moi
T: I want to go home
I want to go home
Je veux aller chez moi
Phrase-based SMT
« Diag-And » phrase extraction
Input: aligned
sentence pair
Output: set of
consistent phrases
Je l’ ai vu à la télévision
I saw him on television
Extract all phrase pairs with no external links, for example:
Good pairs:
(Je, I) (Je l’ ai vu, I saw him) (ai vu, saw) (l’ ai vu à la, saw him on)
Bad pairs:
(Je l’ ai vu, I saw) (l’ ai vu à, saw him on) (la télévision, television)
Phrase-Based Search
Generative process:
1. Split source sentence into “phrases” (N-grams).
2. Translate each source phrase (one-to-one).
3. Permute target phrases to get final translation.
much simpler and more intuitive than the IBM process,
but the price of this is no provision for gaps, e.g., ne VERB pas
1
2
Je
l’
ai
vu
à
la
télévision
Je
l’
ai
vu
à
la
télévision
3
I
him
saw
on
television
*** NOTE: XRCE’s Matrax does handle gaps
I
saw
him
on
television
Phrase-Based Search
Order: Target hypotheses grow left->right, from source segments consumed in any order






Source: s1 s2 s3 s4 s5 s6 s7 s8 s9
(pick s2 s3 first)
Backward Table
Segmentation
P(S|T)
p(s2 s3 | t8)
p(s2 s3 | t5 t3)
…
p(s3 s4 | t4 t9)
…
(pick s3 s4 first)
Source: s1 s2 s3 s4 s5 s6 s7 s8 s9
(phrase transl)
Tgt hyp: t8| …
Tgt hyp: t5 t3| …
Source: s1 s2 s3 s4 s5 s6 s7 s8 s9
(pick s5 s6 s7)
…
(phrase transl)
Tgt hyp: t4 t9| …
…
Source: s1 s2 s3 s4 s5 s6 s7 s8 s9
(phrase transl)
Tgt hyp: t8| t6 t2| …
…
Language
Model
P(T)
language model:
scores growing
target hypotheses
left -> right
phrase table:
1. suggests possible
segments
2. supplies phrase
translation scores
Loglinear Model Combination
Previous slides show basic system that ranks hypotheses by
P(S|T)*P(T). Now let’s introduce an alignment/reordering variable
A (aligns T & S phrases). We want
^
T = argmaxT P(T|S) ≈ argmaxT ,AP(T, A|S) =
argmaxT, A f1(T,A,S)λ1* f2(T,A,S)λ2 * … * fM(T,A,S)λM =
argmax exp (∑i λi log fi(T,A,S)).
The fi now typically include not only functions related to P(S|T)
and language model P(T), but also to A « distortion », P(T|S),
length(T), etc. The λi serve as reliability weights.
This change in score computation doesn’t fundamentally change the
search algorithm.
Loglinear Model Combination
Advantages
Very flexible! Anyone can devise dozens of features.
• E.g., if lots of mismatched brackets in output, include feature function
that outputs +1 if no mismatched brackets, -1 if have mismatched
brackets.
• So lots of new features being tried in somewhat haphazard way.
• But systems steadily improving – outputs from NIST 2006 look much
better than those from NIST 2002. SMT not good enough to replace
human translators, but good enough for, e.g., most Web browsing.
Using 1000 machines and massive quantities of data, Google got 45.4
BLEU for Arabic to English, 35.0 for Chinese to English – very high
scores!
Loglinear Model Combination
Typical Loglinear Components for SMT Decoding
• Joint counts C(S,T) from phrase extraction yield estimates P(S|T) stored in
“backward” phrase table and estimates P(T|S) stored in “forward” phrase
table. These are typically relative frequency estimates (but we’ve looked at
smoothed variants).
• Distortion model D(T,A,S) assigns score to amount of phrase reordering
incurred in going from S to hypothesis T. Can be based purely on
displacement, or be lexicalized (identity of words in S & T is important).
• Length model L(T,S) scores probability that hypothesis of length |T|
generated from source of length |S|.
• Language model P(T) gives probability of word sequence T in target
language – see next few slides.
NOTE: these are just for decoding – you can use lots more components for
N-best/lattice reordering!
Target Language Model P(T)
The Stupidest Thing Noam Chomsky Ever Said
« It must be recognized that the notion of a ‘probability of a sentence’ is
an entirely useless one, under any interpretation of this term ».
Chomsky, 1969.
Target Language Model P(T)
• Language model helps generate fluent output by
1. assigning higher probability to correct word order – e.g.,
PLM(the house is small) >> PLM(small the is house)
2. assigning higher probability to correct word choices – e.g.,
PLM(i am going home) >> PLM(I am going house)
• Almost everyone in both SMT and ASR (automatic speech
recognition) communities uses N-gram language models. Start with
P(W) = P(w1)*P(w2|w1)*P(w3|w1,w2)*…*P(wi|w1,…,wi-1)*…*P(wm|w1,…,wm-1),
then limit window to N words. E.g., for N=3, trigram LM:
P(W) = P(w1)*P(w2|w1)*P(w3|w1,w2)*…*P(wi|wi-2,wi-1)*…*P(wm|wm-2,wm-1).
Target Language Model P(T)
• Estimation is done by relative frequency on large corpus :
P(wi|wi-2,wi-1) ≈ f(wi|wi-2,wi-1) = C(wi-2,wi-1,wi)/Σw C(wi-2,wi-1,w).
E.g., in Europarl corpus, see 225 trigrams starting « the red … »:
C(the red cross)=123, C(the red tape)=31, C(the red army)=9,
C(the red card)=7, C(the red ,)=5 (and 50 other trigrams).
So estimate P(cross | the red) = 123/225 = 0.547 .
• But need to reserve probability mass for unseen events - maybe never
saw « the red planet » in Europarl, but don’t want to have estimate
P(planet | the red) = 0. Also, want estimates whose variance isn’t too
high. Smoothing techniques are used to solve both problems. E.g.,
could linearly smooth trigrams with bigrams & unigrams:
P(wi|wi-2,wi-1) ≈ *f (wi|wi-2,wi-1) + μ*f(wi|wi-1) + (1--μ)*f(wi);
0 < , μ < 1.
Target Language Model P(T)
Measuring Language Model Quality
• Perplexity: metric that measures predictive power of an LM on new
data as an average branching factor. E.g., model that says «any digit 0,
…, 9 has equal probability of occurrence » will yield perplexity of
10.0 on digit sequence generated randomly from these 10 digits.
• Perplexity of LM measured on corpus W = (w1 … wN) is
Perp (T) = (Π P(w |LM))-1/N = 1/(average per word prob.)
LM
wi
i
The better the LM is as a model for W, the less « surprised » it is by
words of W  higher estimated prob.  lower entropy.
Typical perplexities for well-trained English trigram LMs with lexica
of about 25K words for various dictation domains:
Perp(radiology)=20, Perp(emergency medicine)=60,
Perp(journalism)=105, Perp(general English)=247 .
Target Language Model P(T)
• « A Bit of Progress in Language Modeling » (Goodman01) is good
summary of state of the art in N-gram language modeling.
• Consistently superior method: Kneser-Ney.
Intuition: if «Francisco» & «eggplant» each seen 103 times in our
corpus of 106 words, and neither «eggplant Francisco» nor
«eggplant stew» seen, which should be higher,
P(Francisco|eggplant) or P(stew|eggplant)?
Interpolation answer: P(wi|wi-1) ≈ *f(wi|wi-1) + (1-)*f(wi ).
So P(Francisco|eggplant) ≈ *0 + (1- )*10-3 = P(stew|eggplant).
Kneser-Ney answer: no, «Francisco» only occurs after «San», but
1,000 occurrences of « stew » preceded by 100 different words. So
when (wi-1 wi) has never been seen before, wi = «stew» more probable
than wi = «Francisco»  P(stew|eggplant) >> P(Francisco|eggplant).
Target Language Model P(T)
• Kneser-Ney formula (for bigrams – easily extended to N-grams):
PKN(wi | wi-1) = max [C(wi-1 wi)-D, 0]/C(wi-1) +
(wi-1)*#{v | C(v wi) > 0}/w #{v | C(v w) > 0} ,
where D is a discount factor < 1, (wi-1) is a normalization constant,
#{v | C(v wi) > 0} is the number of different words that precede wi in
the training corpus, and w #{v | C(v w) > 0} is the number of
different bigrams in the training corpus.
Flaws of Phrase-based,
Loglinear Systems
• Loglinear feature function combination is too flexible! Makes it
easy not to think about theoretical properties of models.
• The IBM models were true models: given arbitrary source sentence S
and target sentence T, could estimate non-zero P(T|S). Phrase-based
“models” are not models: in general, for T which is a good
translation of S, they give P(T|S) = 0. They don’t guarantee existence
of an alignment between T and S. Thus, the only translations T’ to
which a phrase-based system is guaranteed to assign P(T’|S) > 0 are
T’ output by same system.
• This has practical consequences: in general, a phrase-based MT
system can’t be used for analyzing pre-existing translations. This rules
out many useful forms of assistance to human translators - e.g.,
spotting potential errors in translations based on regions of low
P(T|S).
PORTAGE: A Typical SMT
System
1. Sentence-align a big bilingual corpus
2. On each sentence pair, use IBM models to align words
3. Build phrase tables from word alignments via “diag-and” or
similar heuristic (Koehn03). Backwards phrase table gives
P(S|T) (& is implicit segmentation model).
4. Build language model (LM) for target language: estimates P(T) ,
based on n-grams in T
5. P(S|T) and P(T) are sufficient for decoding, but one often adds
other loglinear feature functions such as a distortion penalty
6. Use (Och03) method to find good weights λi for loglinear
features
7. Optionally, include reordering step: i.e., decoder outputs many
hypotheses (via N-best list or lattice) which are rescored by
larger set of feature functions
PORTAGE: A Typical SMT
System
Core Engine
« Small » set of information sources – for Canoe decoder
(at least one
language model)
LM
(number-of-words
(at least 1 phrase (at least one distortion model)
(any # of additional info.
model)
translation model)
sources - for rescorer only)
TM
DM
Weights for « small » set
Canoe decoder
A1
A2 A3
feature
functions
Source sentence
mais où sont les neiges d’ antan ?
NM
Weighted
« small » info
wLM*LM
wTM*TM
…
wNM*NM
N-best hypotheses
H1: hey , where did the old snow go ? P = 0.41
H2: yet where are yesterday’s snows ? P = 0.33
H3: but where are the snows of yesteryear ? P = 0.18
…
« Large » set of information
Weighted
sources – for Rescorer
« large »info
Weights for « large » set
Rescorer
kLM*LM
kTM*TM
…
kA3*A3
Rescored N-best
H1: but where are the snows of yesteryear ?
P = 0.53
H2: however , where are yesterday’s snows ?
P = 0.20
…
Training Core Components
of
PORTAGE
Preprocessing
Raw parallel corpus
src-lang text
tgt-lang text
src preproc.
tgt. preproc.
Clean, aligned parallel corpus
IBM training
(models 1 & 2)
phrase translation language model
model
phrase pair
extraction
PT
model3
large set
info
rescorer wt
optimizer
…
modelK
extra models for large set
modelK+1
large set wts
w1’, …, wM’
Additional monolingual corpora
Tgt-lang
text
…
…
modelM
Tgt-lang
text
lang. model builder
dev1 corpus
LM
other small set models
dev2 corpus
src
tgt
Src-lang text
Tgt-lang text
sentence
aligner
small set
info only
src
tgt
decoder wt
optimizer
small set wts
w1, …, wK
Canoe Optimization of Weights (COW)
Purpose: find weights [w1, …, ws] on « small » set of information sources (N around 100)
Dev corpus for COW (D sentences)
Initial Weights
D source-language sentences
S1: hé quoi ?
S2: charmante élise , vous devenez mélancolique .
….
SD: la fin .
[w1i , w2i ,…, wsi]
(first call to
Canoe)
D target-language ref. translations
T1: what’s this ?
T2: charming élise , you’re becoming melancholy .
….
TD: the end .
(based on top hyp.)
H1(S1), …, HN(S1), …
(>N hyp. for S1)
H1(S2), …, HN(S2), …
(>N hyp. for S2)
…
H1(SD), …, HN(SD), …
(>N hyp. for SD)
union of
old & new
hypotheses
(first call to
rescore-train)
…
WK=[w1 w2
K,…,
Rescore_train
ws
K]
…
IS
(2nd &
New Weights (from
subsequent
« rescore-train »)
calls to Canoe)
(union: 2nd & subsequent
calls to rescore-train)
K random wt. vectors
W1=[w11 , w21,…, ws1]
K,
I2
I1
[w1r , w2r,…, wsr]
Canoe decoder
Expanded list
BLEU scoring
« Small » set of information sources
Powell’s alg.
…
Powell’s alg.
List of D N-best hyp.
H1(S1): what’s up ?
…
HN (S1): are you OK ?
H1(S2): cute élise ,
you’re bummed out .
…
…
HN(SD): all done .
Ŵ1
…
}Ŵ
ŴK
Rescoring = Finding Weights on « Large » Info. Set for Rescorer
(N around 1000)
I1
« Large » set
I2 … IS
Dev corpus for « large » wts (D sent)
IS+1 … IL
D source-language sentences
S1: hé quoi ?
S2: charmante élise , vous devenez mélancolique .
….
SD: la fin .
« Small » set
Weights for
« small » fixed
by previous
COW step
Initial Weights
Weighted
« small » info
w1* I1
…
wS*IS
[w1i , w2i,…, wLi]
feature
functions
D target-language ref. translations
T1: what’s this ?
T2: charming élise , you’re becoming melancholy .
….
TD: the end .
Weighted
« large » info
w1* I1
…
wL*IL
Canoe decoder
BLEU scoring
Final « large » wts
(based on top hyp.)
[w1f , w2f,…, wLf]
List of D N-best hyp.
H1(S1): what’s up ?
…
HN (S1): are you OK ?
H1(S2): cute élise ,
you’re bummed out .
…
…
HN(SD): all done .
K random wt. vectors
W1=[w11 , w21,…, wL1]
…
K,
WK=[w1 w2
K,…,
Rescore_train
wLK]
Powell’s alg.
…
Powell’s alg.
Ŵ1
…
}Ŵ
ŴK
Tutorial Plan
B. Details & research topics
Named entities
Large-scale discriminative training (George Foster)
Decoding for SMT (prepared by Nicola Ueffing)
Hierarchical models (George Foster)
System combination
Named entity recognition &
transliteration
Chinese Example
« Secretary-General Wong appeared with Larry Ellison, Chief Executive Officer
of Oracle Corporation, at a press conference to announce Oracle’s
investment of $100 million dollars in a new research centre in Szechuan
Province ».
Personal names: “Wong”, “Larry Ellison”.
Titles: “Secretary-General”, “Chief Executive Officer”.
Organization name: “Oracle Corporation”.
Place name: “Szechuan Province”.
Recognition problem: detect these entities in a continuous stream of ideograms.
Transliteration problem: when ideograms are used phonetically (esp. for nonChinese names like “Larry Ellison”) become aware of that, & map them onto
Latin characters.
Named entity recognition &
transliteration
Made-up Chinese Transliteration Example
How to translate “唐纳德·拉姆斯菲尔德”?
唐 [táng] (surname) - Tang Dynasty; 纳(F納) [nà] receive, accept, enjoy, pay, sew;
德 [dé] virtue
拉 [lā] pull, drag, haul; 姆 [mǔ] nurse; 斯 [sī] (thus; now used mostly for sound:)
菲 [fěi] 菲薄 humble; 尔(F爾) [ěr] (archaic:) you; 德 [dé] virtue
 “After receiving virtue from the Tang Dynasty, you thus pulled the
humble nurse away from virtue” (????). No –
« tang na de la mu si fei de » = DONALD RUMSFELD.
Actual ChineseEnglish example generated by PORTAGE
“Outgoing president Iliescu has also congratulated Basescu.” 
“Outgoing president of Iraq, has also been made to the road to the public.”
Named entity recognition &
transliteration
Other Examples
ArabicEnglish:
Muammar Ghadafy = Moammar Khaddafi = Muamar Qadafy = …;
Azeddine = Elzedine = Alsuddin = Ahzudin = … (depending on region, pronounced
differently & thus transliterated into Latin alphabet differently)
EnglishFrench (Google Translate Jan. 24, 2007):
“The Englishman John Snow thought cholera was transmitted by small, living organisms.”

“Le choléra de pensée de neige de John d'Anglais a été transmis par la petite, organique
matière.”
System Combination
Introduction
•
•
•
•
•
Different systems make different errors – why not combine
information? This worked well for ASR …
But, because of reordering, synonyms, etc., system combination not
as easy for MT!
RWTH (Aachen) is SMT powerhouse – has recently been working on
parallel system combination (Evgeny Matusov).
NRC has been working on serial system combination.
Both teams now getting good results.
System Combination
Parallel System Combination (RWTH Aachen)
•
•
•
•
Hypotheses from different systems aligned; some word reordering
allowed; use of synonyms
Generate confusion network  choices at each position scored with
system weights and word confidence scores
N-best consensus translations are generated from confusion network
& rescored with various information sources
A year ago, results unimpressive. Since then, added new information
sources (e.g., LMs trained on N-best lists from contributing systems)
that encourage preservation of original phrases. Nice preliminary
Arabic results: improvement of +2-3 BLEU points over best
individual system in combination.
System Combination
Example of RWTH Parallel Combination
Ref:
Chinese president directs unprecedented criticism at
leaders of Hong Kong.
Best System: Chinese president slams unprecedented leaders to
Hong Kong.
System Comb.: Chinese president sends unprecedented criticism of the
leaders of Hong Kong.
System Combination
Serial System Combination (NRC)
•
Use SMT to correct mistakes made by another method (e.g., a rulebased one)
Source
text
MT1
Initial
target text
SMT
Final
target text
Training Procedure
•
•
Use MT1 to produce initial target translation of source half of a parallel
human-translated corpus, thus giving a corpus of MT1 target output in
parallel with good target versions of same sentences; use parallel corpus
of (MT1 target || human target) sentences to train SMT.
Even better, if can get humans to post-edit MT1 output, have MT1
target in parallel with corrected target as SMT training corpus.
System Combination
Serial System Combination (NRC)
System Combination
Serial System Combination (NRC)
System Combination
Discussion & Future Work
•
•
Parallel combination probably best for similar systems of good
quality, serial combination for systems that are very different
Future work for serial combination: allow SMT both direct & indirect
(via MT1) access to source text. Could do this using, e.g.:
Rescoring
Parallel phrasetables
Parallel LMs
Parallel decoding … (etc.)
Source
text
MT1
Initial
target text
SMT
Final
target text
References (1)
Best overall reference
Philipp Koehn, « Statistical Machine Translation », University of
Edinburgh (textbook to appear 2007 or 2008, Cambridge University
Press).
Papers (NOTE: short summary of key papers available from Kuhn/Foster)
Brown93 Peter F. Brown, Stephen A. Della Pietra, Vincent Della J. Pietra, and Robert L.
Mercer. The mathematics of Machine Translation: Parameter estimation. Computational
Linguistics, 19(2):263-312, June 1993.
Chomsky69 Noam Chomsky. Quine’s Empirical Assertions. In Words and Objections –
Essays on the Work of W.V. Quine (ed. D. Davidson and J. Hintikka). Dordrecht,
Netherlands, 1969.
Foster06 George Foster, Roland Kuhn, and Howard Johnson. Phrasetable Smoothing for
Statistical Machine Translation. EMNLP 2006, Sydney, Australia, July 22-23, 2006.
Germann01 Ulrich Germann, Michael Jahr, Kevin Knight, Daniel Marcu, and Kenji Yamada.
Fast decoding and optimal decoding for machine translation. In Proceedings of the 39th
Annual Meeting of the Association for Computational Linguistics (ACL), Toulouse, July 2001.
References (2)
Goodman01 Joshua Goodman. A Bit of Progress in Language Modeling (extended version).
Microsoft Research Technical Report, Aug. 2001. Downloadable from
research.microsoft.com/~joshuago/publications.htm
Jones05 Douglas Jones, Edward Gibson, et al. Measuring Human Readability of Machine
Generated Text: Studies in Speech Recognition and Machine Translation. In Proceedings of
the IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), Philadelphia,
PA, USA, March 2005 (Special Session on Human Language Technology: Applications and
Challenge of Speech Processing).
Knight99 Kevin Knight. Decoding complexity in word-replacement translation models.
Computational Linguistics, Squibs and Discussion, 25(4), 1999.
Koehn04 Philipp Koehn. Pharaoh: a beam search decoder for phrase-based statistical
machine translation models. In Proceedings of the 6th Conference of the Association for
Machine Translation in the Americas, Georgetown University, Washington D.C., October
2004. Springer-Verlag.
KoehnDec03 Philipp Koehn. PHARAOH - a Beam Search Decoder for Phrase-Based
Statistical Machine Translation Models (User Manual and Description). USC Information
Sciences Institute, Dec. 2003.
References (3)
KoehnMay03 Philipp Koehn, Franz Josef Och, and Daniel Marcu. Statistical phrasebased translation. In Eduard Hovy, editor, Proceedings of the Human Language
Technology Conference of the North American Chapter of the Association for
Computational Linguistics (HLT/NAACL), pp. 127-133, Edmonton, Alberta, Canada,
May 2003.
Marcu02 Daniel Marcu and William Wong. A phrase-based, joint probability model
for statistical machine translation. In Proceedings of the 2002 Conference on
Empirical Methods in Natural Language Processing (EMNLP), Philadelphia, PA,
2002.
OchJHU04 Franz Josef Och, Daniel Gildea, et al. Final Report of the Johns Hopkins
2003 Summer Workshop on Syntax for Statistical Machine Translation (revised
version). http://www.clsp.jhu.edu/ws03/groups/translate (JHU-syntax-for-SMT.pdf),
Feb. 2004.
Och04 Franz Och and Hermann Ney. The alignment template approach to statistical
machine translation. Computational Linguistics, V. 30, pp. 417-449, 2004.
Och03 Franz Josef Och. Minimum error rate training for statistical machine
translation. In Proceedings of the 41th Annual Meeting of the Association for
Computational Linguistics (ACL), Sapporo, July 2003.
References (4)
Och02 Franz Josef Och and Hermann Ney. Discriminative training and maximum
entropy models for statistical machine translation. In Proceedings of the 40th Annual
Meeting of the Association for Computational Linguistics (ACL), Philadelphia, July
2002.
Och01 Franz Josef Och, Nicola Ueffing, and Hermann Ney. An Efficient A* Search
Algorithm for Statistical Machine Translation. In Proc. Data-Driven Machine
Translation Workshop, July 2001.
Och00 Franz Josef Och and Hermann Ney. A Comparison of Alignment Models for
Statistical Machine Translation. Int. Conf. on Computational Linguistics (COLING),
Saarbrucken, Germany, August 2000.
Papineni01 Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. BLEU:
A method for automatic evaluation of Machine Translation. Technical Report
RC22176, IBM, September 2001.
Ueffing02 Nicola Ueffing, Franz Josef Och, and Hermann Ney. Generation of Word
Graphs in Statistical Machine Translation. Empirical Methods in Natural Language
Processing, July 2002.
Download