Discriminative and Neural Translation Models July 2, 2015

advertisement
Discriminative and Neural
Translation Models
July 2, 2015
Noisy Channel MT
p(e)
source
English
Noisy Channel MT
p(g | e)
p(e)
source
English
German
Noisy Channel MT
p(g | e)
p(e)
source
English
German
decoder
⇤
e = arg max p(e | g)
e
p(g | e) p(e)
= arg max
e
p(g)
= arg max p(g | e) p(e)
e
Noisy Channel MT
⇤
e = arg max p(e | g)
e
p(g | e) p(e)
= arg max
e
p(g)
= arg max p(g | e) p(e)
e
And Beyond…
⇤
e = arg max p(e | g)
e
p(g | e) p(e)
= arg max
e
p(g)
= arg max p(g | e) p(e)
e
= arg max log p(g | e) + log p(e)
e
 >
1
log p(g | e)
= arg max
1
log p(e)
e
| {z } |
{z
}
w>
h(g,e)
And Beyond…
⇤
e = arg max p(e | g)
e
p(g | e) p(e)
= arg max
e
p(g)
= arg max p(g | e) p(e)
e
max familiar?
log p(g | e) + log p(e)
Does this= arg
look
e

>

1
log p(g | e)
= arg max
1
log p(e)
e
| {z } |
{z
}
w>
h(g,e)
MT Linear Model
-log p(g | e)
-log p(e)
MT Linear Model
-log p(g | e)
~
w
-log p(e)
MT Linear Model
-log p(g | e)
~
w
-log p(e)
MT Linear Model
-log p(g | e)
~
w
-log p(e)
MT Linear Model
-log p(g | e)
Improvement 1:
~ to find better
change w
translations
~
w
g
-log p(e)
MT Linear Model
-log p(g | e)
~
w
-log p(e)
MT Linear Model
-log p(g | e)
~
w
-log p(e)
MT Linear Model
-log p(g | e)
~
w
-log p(e)
MT Linear Model
-log p(g | e)
Improvement 2:
Add dimensions to make points separable
~
w
-log p(e)
MT Linear Model
⇤
>
e = arg max w h(g, e)
e
• Improve the modeling capacity of the noisy channel
in two ways
• Reorient the weight vector (parameter
optimization)
• Add new dimensions (new features)
• Questions
• What features?
• How do we set the weights?
h(g, e)
w
What Features?
•
Two solutions
•
Engineer features
As a language expert, what do I think is likely to
be useful for distinguishing good from bad
translations?
•
Learn features
Let data decide what features should be used to
solve the problem.
What are Features?
•
A feature function is a function from objects to a
real values
•
A “good” feature function will map inputs that
behave similarly (from the task perspective) into
similar values
•
Example feature functions:
the number of verbs in the translation?
1 if the first word is capitalized, 0 otherwise
the log language model probability of a sequence
horse
|V |
horse
|V |
whorse = Pvhorse
horse
|V |
d
rs
e
whorse = Pvhorse
ho
(
|
{z
|V |
}
horse
|V |
d
ho
(
rs
e
whorse = Pvhorse
animate
|
size
used for transportation
{z
|V |
}
horse
|V |
Words as Vectors
whorse
wdog
wcat
wtrain
wtruck
wcar
wtranslation
Words as Vectors
whorse
wtrain
wtruck
wdog
wcat
✓
wcar
wtranslation
Words as Vectors
whorse
wdog
wcat
wtrain
wtruck
wcar
✓
wtranslation
Features for MT
•
For MT we want to represent translations (not
words) as vectors
•
The space of translations (not just good ones, but
bad ones too!) is huge, so designing this vector
space is huge
•
The noisy channel says: “let’s represent translations
as two-dimensional vectors”
•
Until a year ago, MT was pretty close to that…
Features for MT
• Target side features
[ n-gram language model ]
• log p(e)
• Number of words in hypothesis
• Non-English character count
• Source + target features
• log relative frequency e|f of each rule [ log #(e,f) - log #(f) ]
• log relative frequency f|e of each rule [ log #(e,f) - log #(e) ]
• “lexical translation” log probability e|f of each rule [ ≈ log p
• “lexical translation” log probability f|e of each rule [ ≈ log p
• Other features
• Count of rules/phrases used
• Reordering pattern probabilities
model1(e|f)
]
model1(f|e)
]
29
Computing Features:
Locality
•
Local features
Feature values only depend on rule
•
Source-contextual features
Feature values depend on rule’s target side and the
entire source document
•
Non-local features
Feature values depend on other translation
decisions used
Example: (n-gram) language model
Why do we care?
•
Searching for the best solution with local and sourcecontextual features is inexpensive
•
•
Use dynamic programming / shortest path
Non-local features effectively mean that decisions you
make “early” in the search affect how much each decision
later will cost
•
NP-hard maximization problem in general
•
You will get to (attempt to) solve this in the lab today
Parameter Learning
32
Hypothesis Space
h1
Hypotheses
h2
33
Hypothesis Space
h1
References
h2
34
Preliminaries
We assume a decoder that computes:
⇤
⇤
>
he , a i = arg max w h(g, e, a)
he,ai
35
Preliminaries
We assume a decoder that computes:
⇤
⇤
>
he , a i = arg max w h(g, e, a)
he,ai
And K-best lists of, that is:
{
⇤
⇤ i=K
ei , ai ⇥}i=1
th
>
= arg i - max w h(g, e, a)
he,ai
Standard algorithms exist for both of these.
36
Learning Weights
• Try to match the reference translation exactly
• Conditional random field
• Maximize the score of the reference translations
• “Average” over the different ways of producing the
same translation variables
• Max-margin
I drank
a
beer
I
drank
a
beer
Find
the
weight
vector
that
separates
the
reference
•
translation from others by the maximal margin
of the latent
ich trank
einsetting
Bier
ich variables
trank ein Bier
• Maximal
37
Problems
• These methods give “full credit” when the model
exactly produces the reference and no credit
otherwise
• What is the problem with this?
• There are many ways to translate a sentence
• What if we have multiple reference
translations?
• What about partial credit?
38
Cost-Sensitive Training
• Assume we have a cost function that gives
a score for how good/bad a translation is
(ê, E) ⇥
[0, 1]
• Optimize the weight vector by making
reference to this function
39
K-Best List Example
#2 #1
h1
#3
~
w
#6
#5#4
#7
#8
#10
#9
h2
40
K-Best List Example
#2 #1
h1
0.8  < 1.0
#3
~
w
0.6  < 0.8
#6
#5#4
0.4  < 0.6
#7
0.2  < 0.4
#8
#10
0.0  < 0.2
#9
h2
41
Training as Classification
• Pairwise Ranking Optimization
• Reduce training problem to binary classification with a
linear model
• Algorithm
• For i=1 to N
• Pick random pair of hypotheses (A,B) from K-best list
• Use cost function to determine if is A or B better
• Create ith training instance
• Train binary linear classifier
42
#2 #1
h1
0.8  < 1.0
#3
0.6  < 0.8
#6
#5
#4
0.4  < 0.6
#7
0.2  < 0.4
#8
#10
0.0  < 0.2
#9
h2
h1
h2
43
#2 #1
h1
#3
#6
#5
Worse!
#4
0.6  < 0.8
0.4  < 0.6
#7
0.2  < 0.4
#8
#10
0.8  < 1.0
0.0  < 0.2
#9
h2
h1
h2
44
#2 #1
h1
#3
#6
#5
Worse!
#4
0.6  < 0.8
0.4  < 0.6
#7
0.2  < 0.4
#8
#10
0.8  < 1.0
0.0  < 0.2
#9
h2
h1
h2
45
#2 #1
h1
0.8  < 1.0
#3
0.6  < 0.8
#6
#5
#4
0.4  < 0.6
#7
0.2  < 0.4
#8
#10
0.0  < 0.2
#9
h2
h1
h2
46
#2 #1
h1
0.8  < 1.0
#3
0.6  < 0.8
#6
#5
#4
0.4  < 0.6
#7
#8
#10
0.2  < 0.4
Better!
0.0  < 0.2
#9
h2
h1
h2
47
#2 #1
h1
0.8  < 1.0
#3
0.6  < 0.8
#6
#5
#4
0.4  < 0.6
#7
#8
#10
0.2  < 0.4
Better!
0.0  < 0.2
#9
h2
h1
h2
48
#2 #1
h1
#3
#6
#5
Worse!
#4
0.6  < 0.8
0.4  < 0.6
#7
0.2  < 0.4
#8
#10
0.8  < 1.0
0.0  < 0.2
#9
h2
h1
h2
49
#2 #1
h1
0.8  < 1.0
#3
0.6  < 0.8
#6
#5
#4
0.4  < 0.6
#7
0.2  < 0.4
#8
Better!
#10
0.0  < 0.2
#9
h2
h1
h2
50
#2 #1
h1
0.8  < 1.0
#3
0.6  < 0.8
#6
#5
#4
0.4  < 0.6
#7
0.2  < 0.4
#8
#10
0.0  < 0.2
#9
h2
h1
h2
51
h1
h2
Fit a linear model
52
h1
h2
~
w
Fit a linear model
53
K-Best List Example
#2 #1
h1
0.8  < 1.0
#3
0.6  < 0.8
#6
#5#4
0.4  < 0.6
#7
0.2  < 0.4
#8
0.0  < 0.2
~
w
#10
#9
h2
54
Why do this?
n
Discriminative
55
Questions?
56
What Features?
•
Two solutions
•
Engineer features
As a language expert, what do I think is likely to
be useful for distinguishing good from bad
translations?
•
Learn features
Let data decide what features should be used to
solve the problem.
We will use neural networks for this.
Learning Features with
Neural Networks
•
Neural networks take very “small” features (e.g.,
one word, one pixel, one slice of spectrogram) and
compose them into representations of larger, more
abstract units
•
Compute a representation of a sentence from
words
•
Compute a representation of the contents of an
image from pixels
Bengio et al. (2003)
p(e) =
|e|
Y
i=1
p(ei | ei
p(ei | ei
n+1 , . . . , ei 1 )
ei
1
C
ei
2
C
ei
3
C
=
W
x=x
n+1 , . . . , ei 1 )
V
ei
softmax
tanh
Bengio et al. (2003)
p(e) =
|e|
Y
i=1
p(ei | ei
p(ei | ei
n+1 , . . . , ei 1 )
ei
1
C
ei
2
C
ei
3
C
=
W
x=x
n+1 , . . . , ei 1 )
V
ei
softmax
tanh
Bengio et al. (2003)
p(e) =
|e|
Y
i=1
p(ei | ei
p(ei | ei
n+1 , . . . , ei 1 )
=
ei
1
C
cei
ei
2
C
W
ei
3
C
x=x
n+1 , . . . , ei 1 )
“word embedding”
1
V
ei
softmax
tanh
Bengio et al. (2003)
p(e) =
|e|
Y
i=1
p(ei | ei
p(ei | ei
n+1 , . . . , ei 1 )
=
ei
1
C
cei
1
ei
2
C
cei
W
2
ei
3
C
x=x
n+1 , . . . , ei 1 )
V
ei
softmax
tanh
Bengio et al. (2003)
p(e) =
|e|
Y
i=1
p(ei | ei
p(ei | ei
n+1 , . . . , ei 1 )
n+1 , . . . , ei 1 )
=
ei
1
C
cei
1
ei
2
C
cei
W
2
ei
3
C
cei
3
V
tanh
ei
softmax
Bengio et al. (2003)
p(e) =
|e|
Y
i=1
p(ei | ei
p(ei | ei
n+1 , . . . , ei 1 )
=
ei
1
C
cei
1
ei
2
C
cei
W
2
ei
3
C
x=x
n+1 , . . . , ei 1 )
cei
3
V
tanh
ei
softmax
Bengio et al. (2003)
p(e) =
|e|
Y
i=1
p(ei | ei
p(ei | ei
n+1 , . . . , ei 1 )
ei
1
C
ei
2
C
ei
3
C
=
W
x=x
n+1 , . . . , ei 1 )
V
ei
softmax
Bengio et al. (2003)
p(e) =
|e|
Y
i=1
p(ei | ei
p(ei | ei
n+1 , . . . , ei 1 )
ei
1
C
ei
2
C
ei
3
C
=
W
x=x
n+1 , . . . , ei 1 )
i
Vh = etanh(Wc
e)
softmax
tanh
Bengio et al. (2003)
p(e) =
|e|
Y
i=1
p(ei | ei
p(ei | ei
n+1 , . . . , ei 1 )
ei
1
C
ei
2
C
ei
3
C
=
W
x=x
n+1 , . . . , ei 1 )
i
Vh =etanh(Wc
e + b)
softmax
tanh
Bengio et al. (2003)
p(e) =
|e|
Y
i=1
p(ei | ei
p(ei | ei
n+1 , . . . , ei 1 )
ei
1
C
ei
2
C
ei
3
C
=
W
x=x
n+1 , . . . , ei 1 )
V
tanh
ei
Bengio et al. (2003)
p(e) =
|e|
Y
i=1
p(ei | ei
p(ei | ei
n+1 , . . . , ei 1 )
ei
1
C
ei
2
C
ei
3
C
=
W
x=x
n+1 , . . . , ei 1 )
V
ei
softmax
tanh
Bengio et al. (2003)
p(e) =
|e|
Y
i=1
p(ei | ei
p(ei | ei
n+1 , . . . , ei 1 )
ei
1
C
ei
2
C
ei
3
C
=
W
x=x
n+1 , . . . , ei 1 )
V
ei
softmax
tanh
p(ei | ei
n+1 , . . . , ei 1 )
ei
1
C
ei
2
C
ei
3
C
=
W
x=x
V
ei
softmax
tanh
exp ve>i h(ei 1 , ei 2 , ei 3 )
=P
> h(e
exp
v
i 1 , ei 2 , ei
e0
e0 2V
3)
p(ei | ei
n+1 , . . . , ei 1 )
ei
1
C
ei
2
C
ei
3
C
=
W
x=x
V
ei
softmax
tanh
exp ve>i h(ei 1 , ei 2 , ei 3 )
=P
> h(e
exp
v
i 1 , ei 2 , ei
e0
e0 2V
3)
h is a feature function that extracts
a vector of features from a trigram of words.
What Features?
•
Two solutions
•
Engineer features
As a language expert, what do I think is likely to
be useful for distinguishing good from bad
translations?
•
Learn features (induce features)
Let data decide what features should be used to
solve the problem.
We will use neural networks for this.
Learning Features
•
Neural networks are, of course, nonlinear models with particular
parametric forms
•
But the view that they induce features is important because
while tasks do different things, we often use the same features
in many tasks
•
•
Key opportunity: some tasks have a lot more data to learn
features from
•
Multitask learning: learn the parameters for many tasks jointly
The go-to “auxiliary” task for language is: language modeling
Devlin et al. (2014)
• Turn Bengio et al. (2003) into a translation
model
• Conditional model; generate the next English
word conditioned on
• The previous n English words you generated
• The aligned source word, and its m
neighbors
p(e | f, a) =
p(ei | ei
|e|
Y
i=1
p(ei | ei
2 , ei 1 , fai 1 , fai , fai +1 )
2 , ei 1 , fai 1 , fai , fai +1 )
Source alignment
& its context
(m=4)
Target context
(n=4)
{
{
f ai
1
f ai
=
D
D
fai +1 D
ei
ei
ei
W
1
V
ei
C
softmax
2
3
C
C
x=x
tanh
Devlin et al. (2014)
Neural Translation
•
We know how to use neural networks to condition on
unbounded history (e.g., RNNs, LSTMs) without
engineering features (e.g., n-grams)
•
Translation models can be thought of as a “conditional”
language model; i.e., a sequence model that conditions on
its past + a source sentence
I
drank
a
Neural Translation
•
We know how to use neural networks to condition on
unbounded history (e.g., RNNs, LSTMs) without
engineering features (e.g., n-grams)
•
Translation models can be thought of as a “conditional”
language model; i.e., a sequence model that conditions on
its past + a source sentence
I
drank
a
Neural Translation
•
We know how to use neural networks to condition on
unbounded history (e.g., RNNs, LSTMs) without
engineering features (e.g., n-grams)
•
Translation models can be thought of as a “conditional”
language model; i.e., a sequence model that conditions on
its past + a source sentence
I
drank
a
ich trank eine Cola
Neural Translation
•
We know how to use neural networks to condition on
unbounded history (e.g., RNNs, LSTMs) without
engineering features (e.g., n-grams)
•
Translation models can be thought of as a “conditional”
language model; i.e., a sequence model that conditions on
its past + a source sentence
I
drank
a
Coke
ich trank eine Cola
Neural MT
•
Two big questions:
•
How do we represent the sentence / pieces of
the sentence?
•
How do we influence the prediction of the
language model component?
Encoding Sentences
+
The
+
sun
+
is
“Additive” architecture
=
shining
Encoding Sentences
The
sun
is
“Recurrent” architecture
shining
Sutskever et al.
Encoding Sentences
S
VP
NP
DT
NN
VBZ
JJ
The
sun
is
shining
“Recursive” architecture
Socher et al.
Encoding Sentences
The
sun
is
shining
“Convolutional” architecture
Blunsom et al.
Decoding Vectors
•
•
High level decision
•
The input is represented as a single vector
•
The representation of the input evolves over time
(usually computed via some “attention” mechanism)
Many low-level decisions
•
Do we just have two inputs to the RNN at each time
step?
•
Do we use the input to “gate” the RNN recurrences?
Figure from Mei, Bansal, and Walter (2015)
Figure from Mei, Bansal, and Walter (2015)
Figure from Mei, Bansal, and Walter (2015)
previous output word current input word
previous output word
current input
Figure from Mei, Bansal, and Walter (2015)
Kalchbrenner & Blunsom
(2013)
Kalchbrenner & Blunsom
(2013)
Kalchbrenner & Blunsom
(2013)
Kalchbrenner & Blunsom
(2013)
Kalchbrenner & Blunsom
(2013)
Encoder/Decoder model
English-German WMT2015
Encoder/Decoder model
German-English WMT2015
Questions?
Download