Uploaded by 吴鎏

Lecture 6 - Seq2Seq and Attention

advertisement
Seq2Seq (with Attention)
Pavlos Protopapas
Outline
Sequence 2 Sequence
Sequence 2 Sequence with Attention
2
Outline
Sequence 2 Sequence
Sequence 2 Sequence with Attention
3
Sequence-to-Sequence (seq2seq)
• If our input is a sentence in Language A, and we wish to translate it to
Language B, it is clearly sub-optimal to translate word by word (like our
current models are suited to do).
Output layer
Hidden layer
Frase
ser
a
ℎ!"
ℎ#"
traducida
ℎ$"
Input layer
Sentence
to
be
PROTOPAPAS
translated
Sequence-to-Sequence (seq2seq)
• Instead, let a sequence of tokens be the unit that we ultimately wish to
work with (a sequence of length N may emit a sequences of length M)
• Seq2seq models are comprised of 2 RNNs: 1 encoder, 1 decoder
PROTOPAPAS
Sequence-to-Sequence (seq2seq)
ℎ!"
ℎ#"
ℎ$"
Sentence
to
be
ℎ%"
Hidden layer
Input layer
ENCODER RNN
translated
PROTOPAPAS
Sequence-to-Sequence (seq2seq)
The final hidden state of the encoder RNN is the initial state of the decoder RNN
ℎ!"
ℎ#"
ℎ$"
Sentence
to
be
ℎ%"
Hidden layer
Input layer
ENCODER RNN
translated
PROTOPAPAS
Sequence-to-Sequence (seq2seq)
The final hidden state of the encoder RNN is the initial state of the decoder RNN
ℎ!"
ℎ#"
ℎ$"
Sentence
to
be
ℎ%"
ℎ!&
translated
<s>
Hidden layer
Input layer
ENCODER RNN
PROTOPAPAS
Special
Character
Sequence-to-Sequence (seq2seq)
The final hidden state of the encoder RNN is the initial state of the decoder RNN
Oración
ℎ!"
ℎ#"
ℎ$"
Sentence
to
be
ℎ%"
ℎ!&
translated
<s>
Hidden layer
Input layer
ENCODER RNN
PROTOPAPAS
Sequence-to-Sequence (seq2seq)
The final hidden state of the encoder RNN is the initial state of the decoder RNN
Oración
ℎ!"
ℎ#"
ℎ$"
Sentence
to
be
ℎ%"
ℎ!&
ℎ#&
translated
<s>
Oración
Hidden layer
Input layer
ENCODER RNN
PROTOPAPAS
DECODER RNN
Sequence-to-Sequence (seq2seq)
The final hidden state of the encoder RNN is the initial state of the decoder RNN
Oración
ℎ!"
ℎ#"
ℎ$"
Sentence
to
be
por
ℎ%"
ℎ!&
ℎ#&
translated
<s>
Oración
Hidden layer
Input layer
ENCODER RNN
PROTOPAPAS
DECODER RNN
Sequence-to-Sequence (seq2seq)
The final hidden state of the encoder RNN is the initial state of the decoder RNN
Oración
ℎ!"
ℎ#"
ℎ$"
Sentence
to
be
por
ℎ%"
ℎ!&
ℎ#&
ℎ$&
translated
<s>
Oración
por
Hidden layer
Input layer
ENCODER RNN
PROTOPAPAS
DECODER RNN
Sequence-to-Sequence (seq2seq)
The final hidden state of the encoder RNN is the initial state of the decoder RNN
Oración
ℎ!"
ℎ#"
ℎ$"
Sentence
to
be
por
traducir
ℎ%"
ℎ!&
ℎ#&
ℎ$&
translated
<s>
Oración
por
Hidden layer
Input layer
ENCODER RNN
PROTOPAPAS
DECODER RNN
Sequence-to-Sequence (seq2seq)
The final hidden state of the encoder RNN is the initial state of the decoder RNN
Oración
ℎ!"
ℎ#"
ℎ$"
Sentence
to
be
por
traducir
ℎ%"
ℎ!&
ℎ#&
ℎ$&
ℎ%&
translated
<s>
Oración
por
traducir
Hidden layer
Input layer
ENCODER RNN
PROTOPAPAS
DECODER RNN
Sequence-to-Sequence (seq2seq)
The final hidden state of the decoder RNN is </s>
Oración
ℎ!"
ℎ#"
ℎ$"
Sentence
to
be
por
traducir </s>
ℎ%"
ℎ!&
ℎ#&
ℎ$&
ℎ%&
translated
<s>
Oración
por
traducir
Hidden layer
Input layer
ENCODER RNN
PROTOPAPAS
DECODER RNN
Sequence-to-Sequence (seq2seq)
Training occurs like RNNs typically do. Decoder generates outputs one word at a time, until we
generate a </s> token.
ℎ!"
ℎ#"
ℎ$"
Sentence
to
be
𝑦!!
𝑦!"
𝑦!#
𝑦!$
ℎ%"
ℎ!&
ℎ#&
ℎ$&
ℎ%&
translated
<s>
Oración
por
traducir
Hidden layer
Input layer
ENCODER RNN
PROTOPAPAS
DECODER RNN
Sequence-to-Sequence (seq2seq)
𝑳𝟐
Oración
por
𝑦!!
𝑦!"
𝑦!#
𝑦!$
ℎ%"
ℎ!&
ℎ#&
ℎ$&
ℎ%&
translated
<s>
Oración
por
traducir
The loss is computed by multiply the probability
vector with the one-hot-encoded vector of the word.
ℎ!"
ℎ#"
ℎ$"
Sentence
to
be
𝑳𝟑
𝑳𝟏
Each 𝑦'! returns a probability vector which is of the
size of the vocabulary.
𝑳𝟒
traducir </s>
Hidden layer
Input layer
ENCODER RNN
PROTOPAPAS
DECODER RNN
Sequence-to-Sequence (seq2seq)
ℎ!"
ℎ#"
ℎ$"
Sentence
to
be
𝑳𝟑
𝑳𝟏
𝑳𝟐
Oración
por
𝑦!!
𝑦!"
𝑦!#
𝑦!$
ℎ%"
ℎ!&
ℎ#&
ℎ$&
ℎ%&
translated
<s>
Oración
por
traducir
Training occurs like RNNs typically do; the
loss (from the decoder outputs) is
calculated, and we update weights all the
way to the beginning (encoder)
𝑳𝟒
traducir </s>
Hidden layer
Input layer
ENCODER RNN
PROTOPAPAS
DECODER RNN
Sequence-to-Sequence (seq2seq)
Prediction: During actual translation, the output of time t becomes the input for time t+1.
ℎ!"
ℎ#"
ℎ$"
Sentence
to
be
𝑦!!
𝑦!"
𝑦!#
𝑦!$
ℎ%"
ℎ!&
ℎ#&
ℎ$&
ℎ%&
translated
<s>
𝑦!!
𝑦!"
𝑦!#
Hidden layer
Input layer
ENCODER RNN
PROTOPAPAS
DECODER RNN
Sequence-to-Sequence (seq2seq)
However, the seq2seq model outputs words equivalent to the longest input in the selected
batch.
ℎ!"
ℎ#"
ℎ$"
Sentence
to
be
𝑦!!
𝑦!"
𝑦!#
𝑦!$
ℎ%"
ℎ!&
ℎ#&
ℎ$&
ℎ%&
translated
<s>
𝑦!!
𝑦!"
𝑦!#
Hidden layer
Input layer
ENCODER RNN
PROTOPAPAS
DECODER RNN
Sequence-to-Sequence (seq2seq)
However, the seq2seq model outputs words equivalent to the longest input in the selected
batch.
Oración
ℎ!"
ℎ#"
ℎ$"
Sentence
to
be
por
traducir </s>
ℎ%"
ℎ!&
ℎ#&
ℎ$&
ℎ%&
ℎ)&
translated
<s>
Oración
por
traducir
</s>
Hidden layer
Input layer
ENCODER RNN
PROTOPAPAS
DECODER RNN
Sequence-to-Sequence (seq2seq)
However, the seq2seq model outputs words equivalent to the longest input in the selected batch.
During the final processing, all the words after </s> are thrown out.
Oración
ℎ!"
ℎ#"
ℎ$"
Sentence
to
be
por
traducir </s>
el
ℎ%"
ℎ!&
ℎ#&
ℎ$&
ℎ%&
ℎ)&
translated
<s>
Oración
por
traducir
</s>
Hidden layer
Input layer
ENCODER RNN
PROTOPAPAS
DECODER RNN
Sequence-to-Sequence (seq2seq)
See any issues with this traditional seq2seq paradigm?
PROTOPAPAS
Sequence-to-Sequence (seq2seq)
It’s crazy that the entire “meaning” of the 1st sequence is expected to be packed
into one embedding, and that the encoder then never interacts w/ the decoder
again. Hands free!
What other alternatives can we have?
PROTOPAPAS
Sequence-to-Sequence (seq2seq)
One alternative would be to pass every state of the encoder to every state of the decoder.
ℎ!"
ℎ#"
ℎ$"
Sentence
to
be
ℎ%"
ℎ!&
ℎ#&
ℎ$&
ℎ%&
Hidden layer
Input layer
translated <s>
ENCODER RNN
Oración
por
traducir
DECODER RNN
PROTOPAPAS
Sequence-to-Sequence (seq2seq)
But how much context is enough?
The number of states to send to the decoder can get extremely large.
ℎ!"
ℎ#"
ℎ$"
Sentence
to
be
ℎ%"
ℎ!&
ℎ#&
ℎ$&
ℎ%&
Hidden layer
Input layer
translated <s>
ENCODER RNN
Oración
por
traducir
DECODER RNN
PROTOPAPAS
Sequence-to-Sequence (seq2seq)
Another alternative could be to weight every state
&
ℎ%& = tanh(𝑉𝑥% + 𝑈ℎ%'(
+ 𝑊( ℎ() + 𝑊* ℎ*) + 𝑊+ ℎ+) + ⋯ )
ℎ!"
ℎ#"
ℎ$"
Sentence
to
be
ℎ%"
ℎ!&
ℎ#&
ℎ$&
ℎ%&
Hidden layer
Input layer
translated <s>
ENCODER RNN
Oración
por
traducir
DECODER RNN
PROTOPAPAS
Sequence-to-Sequence (seq2seq)
The real problem with this approach is that the weights 𝑊( , 𝑊* , 𝑊+ ,…. are fixed
&
ℎ%& = tanh(𝑉𝑥% + 𝑈ℎ%'(
+ 𝑊( ℎ() + 𝑊* ℎ*) + 𝑊+ ℎ+) + ⋯ )
ℎ!"
ℎ#"
ℎ$"
Sentence
to
be
ℎ%"
ℎ!&
ℎ#&
ℎ$&
ℎ%&
Hidden layer
Input layer
translated <s>
ENCODER RNN
Oración
por
traducir
DECODER RNN
PROTOPAPAS
Sequence-to-Sequence (seq2seq)
The real problem with this approach is that the weights 𝑊! , 𝑊" , 𝑊# ,…. are fixed
&
ℎ%& = tanh(𝑉𝑥% + 𝑈ℎ%'(
+ 𝑊( ℎ() + 𝑊* ℎ*) + 𝑊+ ℎ+) + ⋯ )
What we want instead is for the decoder, at each step, to decide how much attention
to pay to each of the encoder’s hidden states?
(
𝑊%& = 𝑔( ℎ&' , 𝑋%( , ℎ%)!
)
where g is a function parameterized by all the states of the encoder, the current input
to the decoder and the state of the decoder. W indicates how much attention to pay to
each hidden state of the encoder.
The function 𝒈 gives what we call the attention.
PROTOPAPAS
Outline
Sequence 2 Sequence
Sequence 2 Sequence with Attention
30
Seq2Seq + Attention
Q: How do we determine how much attention to pay to each of the encoder’s hidden states
i.e. determine g( . ) ?
PROTOPAPAS
31
Seq2Seq + Attention
Q: How do we determine how much attention to pay to each of the encoder’s hidden states
i.e. determine g( . ) ?
.4?
ℎ!"
.3?
ℎ#"
.1?
ℎ$"
.2?
ℎ%"
Sentence
to
be
translated
Hidden layer
Input layer
ENCODER RNN
PROTOPAPAS
Seq2Seq + Attention
Q: How do we determine how much attention to pay to each of the encoder’s hidden states
i.e. determine g( . ) ?
A: Let’s base it on our decoder’s previous hidden state (our latest representation of
meaning) and all of the encoder’s hidden states!
.4?
ℎ!"
.3?
ℎ#"
.1?
ℎ$"
.2?
ℎ%"
ℎ!&
Hidden layer
<s>
Input layer
Sentence
to
be
ENCODER RNN
translated
PROTOPAPAS
33
Seq2Seq + Attention
Q: How do we determine how much attention to pay to each of the encoder’s hidden states
i.e. determine g( . ) ?
A: Let’s base it on our decoder’s previous hidden state (our latest representation of
meaning) and all of the encoder’s hidden states!
For this, we define a context vector 𝑐, , computed as a weighted sum of the encoder states ℎ,) .
-!
&
𝑐* = * 𝛼*+ (ℎ*.!
, ℎ+" )ℎ+"
+,!
The weight 𝛼,% for each state
ℎ%)
is computed as 𝛼,% =
PROTOPAPAS
exp(𝑒,% )
!
∑/-.(
exp(𝑒,- )
NN: Is a
FCNN with
weights 𝑊"
&
where 𝑒,% = 𝑁𝑁0" (ℎ,'(
, ℎ%) )
34
Seq2Seq + Attention
Q: How do we determine how much to pay attention to each of the encoder’s hidden states?
A: Let’s base it on our decoder’s previous hidden state (our latest representation of
meaning) and all of the encoder’s hidden states! We want to measure similarity between
decoder hidden state and encoder hidden states in some ways.
ℎ!"
ℎ#"
ℎ$"
ℎ%"
ℎ!&
Could be
cosine
similarity
𝑒! 1.5
Hidden layer
𝑁𝑁#
<s>
Input layer
Sentence
to
be
ENCODER RNN
translated
PROTOPAPAS
ℎ!"
ℎ!&
Separate FFNN
DECODER RNN
ATTENTION LAYER
Seq2Seq + Attention
Q: How do we determine how much to pay attention to each of the encoder’s hidden states?
A: Let’s base it on our decoder’s previous hidden state (our latest representation of
meaning) and all of the encoder’s hidden states! We want to measure similarity between
decoder hidden state and encoder hidden states in some ways.
ℎ!"
ℎ#"
ℎ$"
ℎ%"
ℎ!&
Could be
cosine
similarity
𝑒# 0.9
Hidden layer
𝑁𝑁#
<s>
Input layer
Sentence
to
be
ENCODER RNN
translated
PROTOPAPAS
ℎ#"
ℎ!&
Separate FFNN
DECODER RNN
ATTENTION LAYER
Seq2Seq + Attention
Q: How do we determine how much to pay attention to each of the encoder’s hidden states?
A: Let’s base it on our decoder’s previous hidden state (our latest representation of
meaning) and all of the encoder’s hidden states! We want to measure similarity between
decoder hidden state and encoder hidden states in some ways.
ℎ!"
ℎ#"
ℎ$"
ℎ%"
ℎ!&
Could be
cosine
similarity
𝑒$ 0.3
Hidden layer
𝑁𝑁#
<s>
Input layer
Sentence
to
be
ENCODER RNN
translated
PROTOPAPAS
ℎ$"
ℎ!&
Separate FFNN
DECODER RNN
ATTENTION LAYER
Seq2Seq + Attention
Q: How do we determine how much to pay attention to each of the encoder’s hidden states?
A: Let’s base it on our decoder’s previous hidden state (our latest representation of
meaning) and all of the encoder’s hidden states! We want to measure similarity between
decoder hidden state and encoder hidden states in some ways.
ℎ!"
ℎ#"
ℎ$"
ℎ%"
ℎ!&
Could be
cosine
similarity
𝑒% −0.5
Hidden layer
𝑁𝑁#
<s>
Input layer
Sentence
to
be
ENCODER RNN
translated
PROTOPAPAS
ℎ%"
ℎ!&
Separate FFNN
DECODER RNN
ATTENTION LAYER
Seq2Seq + Attention
Q: How do we determine how much to pay attention to each of the encoder’s hidden states?
A: Let’s base it on our decoder’s previous hidden state (our latest representation of
meaning) and all of the encoder’s hidden states! We want to measure similarity between
decoder hidden state and encoder hidden states in some ways.
Attention (raw scores)
"
&
"
"
"
ℎ!
ℎ#
ℎ$
ℎ%
ℎ!
Hidden layer
𝑒! 1.5
𝑒# 0.9
𝑒$ 0.3
𝑒% −0.5
<s>
Input layer
Sentence
to
be
ENCODER RNN
translated
PROTOPAPAS
DECODER RNN
ATTENTION LAYER
Seq2Seq + Attention
Q: How do we determine how much to pay attention to each of the encoder’s hidden states?
A: Let’s base it on our decoder’s previous hidden state (our latest representation of
meaning) and all of the encoder’s hidden states! We want to measure similarity between
Attention (raw scores)
decoder hidden state and encoder hidden states in some ways.
ℎ!"
ℎ#"
ℎ$"
ℎ%"
ℎ!&
𝑒! 1.5
𝑒# 0.9
𝑒$ 0.3
𝑒% −0.5
<s>
Attention (softmax’d)
Hidden layer
Input layer
Sentence
to
be
ENCODER RNN
𝛼*!
translated
PROTOPAPAS
DECODER RNN
exp(𝑒* )
= 1
∑* exp(𝑒!)
ATTENTION LAYER
Seq2Seq + Attention
Q: How do we determine how much to pay attention to each of the encoder’s hidden states?
A: Let’s base it on our decoder’s previous hidden state (our latest representation of
meaning) and all of the encoder’s hidden states! We want to measure similarity between
Attention (raw scores)
decoder hidden state and encoder hidden states in some ways.
ℎ!"
ℎ#"
ℎ$"
ℎ%"
ℎ!&
Hidden layer
𝑒! 1.5
𝑒# 0.9
𝑒$ 0.3
𝑒% −0.5
Attention (softmax’d)
<s>
Input layer
Sentence
to
be
ENCODER RNN
translated
PROTOPAPAS
DECODER RNN
𝛼(( = 0.51
𝛼*( = 0.28
𝛼+( = 0.14
𝛼+( = 0.07
Seq2Seq + Attention
𝑐!&
We multiply each hidden state by
𝛼!!
ℎ!"
𝛼$!
𝛼#!
ℎ#"
ℎ$"
its 𝛼,( attention weights and then
add the resulting vectors to
create a context vector 𝑐(& .
𝛼%!
ℎ%"
𝑦!!
Hidden layer
Attention (softmax’d)
<s>
Input layer
Sentence
to
be
ENCODER RNN
translated
PROTOPAPAS
DECODER RNN
𝛼(( = 0.51
𝛼*( = 0.28
𝛼+( = 0.14
𝛼+( = 0.07
Seq2Seq + Attention
𝑐!&
Oración
𝛼!!
ℎ!"
𝛼$!
𝛼#!
ℎ#"
ℎ$"
𝛼%!
𝑦!!
ℎ%"
ℎ$% = ℎ&'
Hidden layer
Input layer
Sentence
to
be
ENCODER RNN
translated
PROTOPAPAS
[𝑐(& ,<s>]
DECODER RNN
Seq2Seq + Attention
𝑐#&
𝛼!#
ℎ!"
𝛼$#
𝛼##
ℎ#"
ℎ$"
to
be
Oración
por
𝑦!!
𝑦!"
𝛼%#
ℎ%"
Hidden layer
Input layer
Sentence
ENCODER RNN
translated
PROTOPAPAS
[𝑐(& ,<s>] [𝑐*& , 𝑦B( ]
DECODER RNN
Seq2Seq + Attention
𝑐$&
𝛼!$
ℎ!"
𝛼#$
𝛼$$
ℎ#"
ℎ$"
to
be
Oración
por
traducir
𝑦!!
𝑦!"
𝑦!#
𝛼%$
ℎ%"
Hidden layer
Input layer
Sentence
ENCODER RNN
translated
PROTOPAPAS
&
[𝑐(& ,<s>] [𝑐*& , 𝑦B( ] [𝑐+ , 𝑦B* ]
DECODER RNN
Seq2Seq + Attention
𝑐%&
𝛼!%
ℎ!"
𝛼$%
𝛼#%
ℎ#"
ℎ$"
to
be
Oración
por
traducir
</s>
𝑦!!
𝑦!"
𝑦!#
𝑦!$
𝛼%%
ℎ%"
Hidden layer
Input layer
Sentence
ENCODER RNN
translated
PROTOPAPAS
&
&
[𝑐(& ,<s>] [𝑐*& , 𝑦B( ] [𝑐+ , 𝑦B* ] [𝑐1 , 𝑦B+ ]
DECODER RNN
Seq2Seq + Attention
Attention:
• greatly improves seq2seq results
• allows us to visualize the
contribution each word gave
during each step of the decoder
Image source: Fig 3 in Bahdanau et al., 2015
PROTOPAPAS
Exercise: Vanilla Seq2Seq from Scratch
The aim of this exercise is to train a Seq2Seq model
from scratch. For this, we will be using an English to
Spanish translation dataset.
What’s next
Lecture 7: Self attention models, Transformer models
Lecture 8: BERT
PROTOPAPAS
THANK YOU
PROTOPAPAS
Related documents
Download