Seq2Seq (with Attention) Pavlos Protopapas Outline Sequence 2 Sequence Sequence 2 Sequence with Attention 2 Outline Sequence 2 Sequence Sequence 2 Sequence with Attention 3 Sequence-to-Sequence (seq2seq) • If our input is a sentence in Language A, and we wish to translate it to Language B, it is clearly sub-optimal to translate word by word (like our current models are suited to do). Output layer Hidden layer Frase ser a ℎ!" ℎ#" traducida ℎ$" Input layer Sentence to be PROTOPAPAS translated Sequence-to-Sequence (seq2seq) • Instead, let a sequence of tokens be the unit that we ultimately wish to work with (a sequence of length N may emit a sequences of length M) • Seq2seq models are comprised of 2 RNNs: 1 encoder, 1 decoder PROTOPAPAS Sequence-to-Sequence (seq2seq) ℎ!" ℎ#" ℎ$" Sentence to be ℎ%" Hidden layer Input layer ENCODER RNN translated PROTOPAPAS Sequence-to-Sequence (seq2seq) The final hidden state of the encoder RNN is the initial state of the decoder RNN ℎ!" ℎ#" ℎ$" Sentence to be ℎ%" Hidden layer Input layer ENCODER RNN translated PROTOPAPAS Sequence-to-Sequence (seq2seq) The final hidden state of the encoder RNN is the initial state of the decoder RNN ℎ!" ℎ#" ℎ$" Sentence to be ℎ%" ℎ!& translated <s> Hidden layer Input layer ENCODER RNN PROTOPAPAS Special Character Sequence-to-Sequence (seq2seq) The final hidden state of the encoder RNN is the initial state of the decoder RNN Oración ℎ!" ℎ#" ℎ$" Sentence to be ℎ%" ℎ!& translated <s> Hidden layer Input layer ENCODER RNN PROTOPAPAS Sequence-to-Sequence (seq2seq) The final hidden state of the encoder RNN is the initial state of the decoder RNN Oración ℎ!" ℎ#" ℎ$" Sentence to be ℎ%" ℎ!& ℎ#& translated <s> Oración Hidden layer Input layer ENCODER RNN PROTOPAPAS DECODER RNN Sequence-to-Sequence (seq2seq) The final hidden state of the encoder RNN is the initial state of the decoder RNN Oración ℎ!" ℎ#" ℎ$" Sentence to be por ℎ%" ℎ!& ℎ#& translated <s> Oración Hidden layer Input layer ENCODER RNN PROTOPAPAS DECODER RNN Sequence-to-Sequence (seq2seq) The final hidden state of the encoder RNN is the initial state of the decoder RNN Oración ℎ!" ℎ#" ℎ$" Sentence to be por ℎ%" ℎ!& ℎ#& ℎ$& translated <s> Oración por Hidden layer Input layer ENCODER RNN PROTOPAPAS DECODER RNN Sequence-to-Sequence (seq2seq) The final hidden state of the encoder RNN is the initial state of the decoder RNN Oración ℎ!" ℎ#" ℎ$" Sentence to be por traducir ℎ%" ℎ!& ℎ#& ℎ$& translated <s> Oración por Hidden layer Input layer ENCODER RNN PROTOPAPAS DECODER RNN Sequence-to-Sequence (seq2seq) The final hidden state of the encoder RNN is the initial state of the decoder RNN Oración ℎ!" ℎ#" ℎ$" Sentence to be por traducir ℎ%" ℎ!& ℎ#& ℎ$& ℎ%& translated <s> Oración por traducir Hidden layer Input layer ENCODER RNN PROTOPAPAS DECODER RNN Sequence-to-Sequence (seq2seq) The final hidden state of the decoder RNN is </s> Oración ℎ!" ℎ#" ℎ$" Sentence to be por traducir </s> ℎ%" ℎ!& ℎ#& ℎ$& ℎ%& translated <s> Oración por traducir Hidden layer Input layer ENCODER RNN PROTOPAPAS DECODER RNN Sequence-to-Sequence (seq2seq) Training occurs like RNNs typically do. Decoder generates outputs one word at a time, until we generate a </s> token. ℎ!" ℎ#" ℎ$" Sentence to be 𝑦!! 𝑦!" 𝑦!# 𝑦!$ ℎ%" ℎ!& ℎ#& ℎ$& ℎ%& translated <s> Oración por traducir Hidden layer Input layer ENCODER RNN PROTOPAPAS DECODER RNN Sequence-to-Sequence (seq2seq) 𝑳𝟐 Oración por 𝑦!! 𝑦!" 𝑦!# 𝑦!$ ℎ%" ℎ!& ℎ#& ℎ$& ℎ%& translated <s> Oración por traducir The loss is computed by multiply the probability vector with the one-hot-encoded vector of the word. ℎ!" ℎ#" ℎ$" Sentence to be 𝑳𝟑 𝑳𝟏 Each 𝑦'! returns a probability vector which is of the size of the vocabulary. 𝑳𝟒 traducir </s> Hidden layer Input layer ENCODER RNN PROTOPAPAS DECODER RNN Sequence-to-Sequence (seq2seq) ℎ!" ℎ#" ℎ$" Sentence to be 𝑳𝟑 𝑳𝟏 𝑳𝟐 Oración por 𝑦!! 𝑦!" 𝑦!# 𝑦!$ ℎ%" ℎ!& ℎ#& ℎ$& ℎ%& translated <s> Oración por traducir Training occurs like RNNs typically do; the loss (from the decoder outputs) is calculated, and we update weights all the way to the beginning (encoder) 𝑳𝟒 traducir </s> Hidden layer Input layer ENCODER RNN PROTOPAPAS DECODER RNN Sequence-to-Sequence (seq2seq) Prediction: During actual translation, the output of time t becomes the input for time t+1. ℎ!" ℎ#" ℎ$" Sentence to be 𝑦!! 𝑦!" 𝑦!# 𝑦!$ ℎ%" ℎ!& ℎ#& ℎ$& ℎ%& translated <s> 𝑦!! 𝑦!" 𝑦!# Hidden layer Input layer ENCODER RNN PROTOPAPAS DECODER RNN Sequence-to-Sequence (seq2seq) However, the seq2seq model outputs words equivalent to the longest input in the selected batch. ℎ!" ℎ#" ℎ$" Sentence to be 𝑦!! 𝑦!" 𝑦!# 𝑦!$ ℎ%" ℎ!& ℎ#& ℎ$& ℎ%& translated <s> 𝑦!! 𝑦!" 𝑦!# Hidden layer Input layer ENCODER RNN PROTOPAPAS DECODER RNN Sequence-to-Sequence (seq2seq) However, the seq2seq model outputs words equivalent to the longest input in the selected batch. Oración ℎ!" ℎ#" ℎ$" Sentence to be por traducir </s> ℎ%" ℎ!& ℎ#& ℎ$& ℎ%& ℎ)& translated <s> Oración por traducir </s> Hidden layer Input layer ENCODER RNN PROTOPAPAS DECODER RNN Sequence-to-Sequence (seq2seq) However, the seq2seq model outputs words equivalent to the longest input in the selected batch. During the final processing, all the words after </s> are thrown out. Oración ℎ!" ℎ#" ℎ$" Sentence to be por traducir </s> el ℎ%" ℎ!& ℎ#& ℎ$& ℎ%& ℎ)& translated <s> Oración por traducir </s> Hidden layer Input layer ENCODER RNN PROTOPAPAS DECODER RNN Sequence-to-Sequence (seq2seq) See any issues with this traditional seq2seq paradigm? PROTOPAPAS Sequence-to-Sequence (seq2seq) It’s crazy that the entire “meaning” of the 1st sequence is expected to be packed into one embedding, and that the encoder then never interacts w/ the decoder again. Hands free! What other alternatives can we have? PROTOPAPAS Sequence-to-Sequence (seq2seq) One alternative would be to pass every state of the encoder to every state of the decoder. ℎ!" ℎ#" ℎ$" Sentence to be ℎ%" ℎ!& ℎ#& ℎ$& ℎ%& Hidden layer Input layer translated <s> ENCODER RNN Oración por traducir DECODER RNN PROTOPAPAS Sequence-to-Sequence (seq2seq) But how much context is enough? The number of states to send to the decoder can get extremely large. ℎ!" ℎ#" ℎ$" Sentence to be ℎ%" ℎ!& ℎ#& ℎ$& ℎ%& Hidden layer Input layer translated <s> ENCODER RNN Oración por traducir DECODER RNN PROTOPAPAS Sequence-to-Sequence (seq2seq) Another alternative could be to weight every state & ℎ%& = tanh(𝑉𝑥% + 𝑈ℎ%'( + 𝑊( ℎ() + 𝑊* ℎ*) + 𝑊+ ℎ+) + ⋯ ) ℎ!" ℎ#" ℎ$" Sentence to be ℎ%" ℎ!& ℎ#& ℎ$& ℎ%& Hidden layer Input layer translated <s> ENCODER RNN Oración por traducir DECODER RNN PROTOPAPAS Sequence-to-Sequence (seq2seq) The real problem with this approach is that the weights 𝑊( , 𝑊* , 𝑊+ ,…. are fixed & ℎ%& = tanh(𝑉𝑥% + 𝑈ℎ%'( + 𝑊( ℎ() + 𝑊* ℎ*) + 𝑊+ ℎ+) + ⋯ ) ℎ!" ℎ#" ℎ$" Sentence to be ℎ%" ℎ!& ℎ#& ℎ$& ℎ%& Hidden layer Input layer translated <s> ENCODER RNN Oración por traducir DECODER RNN PROTOPAPAS Sequence-to-Sequence (seq2seq) The real problem with this approach is that the weights 𝑊! , 𝑊" , 𝑊# ,…. are fixed & ℎ%& = tanh(𝑉𝑥% + 𝑈ℎ%'( + 𝑊( ℎ() + 𝑊* ℎ*) + 𝑊+ ℎ+) + ⋯ ) What we want instead is for the decoder, at each step, to decide how much attention to pay to each of the encoder’s hidden states? ( 𝑊%& = 𝑔( ℎ&' , 𝑋%( , ℎ%)! ) where g is a function parameterized by all the states of the encoder, the current input to the decoder and the state of the decoder. W indicates how much attention to pay to each hidden state of the encoder. The function 𝒈 gives what we call the attention. PROTOPAPAS Outline Sequence 2 Sequence Sequence 2 Sequence with Attention 30 Seq2Seq + Attention Q: How do we determine how much attention to pay to each of the encoder’s hidden states i.e. determine g( . ) ? PROTOPAPAS 31 Seq2Seq + Attention Q: How do we determine how much attention to pay to each of the encoder’s hidden states i.e. determine g( . ) ? .4? ℎ!" .3? ℎ#" .1? ℎ$" .2? ℎ%" Sentence to be translated Hidden layer Input layer ENCODER RNN PROTOPAPAS Seq2Seq + Attention Q: How do we determine how much attention to pay to each of the encoder’s hidden states i.e. determine g( . ) ? A: Let’s base it on our decoder’s previous hidden state (our latest representation of meaning) and all of the encoder’s hidden states! .4? ℎ!" .3? ℎ#" .1? ℎ$" .2? ℎ%" ℎ!& Hidden layer <s> Input layer Sentence to be ENCODER RNN translated PROTOPAPAS 33 Seq2Seq + Attention Q: How do we determine how much attention to pay to each of the encoder’s hidden states i.e. determine g( . ) ? A: Let’s base it on our decoder’s previous hidden state (our latest representation of meaning) and all of the encoder’s hidden states! For this, we define a context vector 𝑐, , computed as a weighted sum of the encoder states ℎ,) . -! & 𝑐* = * 𝛼*+ (ℎ*.! , ℎ+" )ℎ+" +,! The weight 𝛼,% for each state ℎ%) is computed as 𝛼,% = PROTOPAPAS exp(𝑒,% ) ! ∑/-.( exp(𝑒,- ) NN: Is a FCNN with weights 𝑊" & where 𝑒,% = 𝑁𝑁0" (ℎ,'( , ℎ%) ) 34 Seq2Seq + Attention Q: How do we determine how much to pay attention to each of the encoder’s hidden states? A: Let’s base it on our decoder’s previous hidden state (our latest representation of meaning) and all of the encoder’s hidden states! We want to measure similarity between decoder hidden state and encoder hidden states in some ways. ℎ!" ℎ#" ℎ$" ℎ%" ℎ!& Could be cosine similarity 𝑒! 1.5 Hidden layer 𝑁𝑁# <s> Input layer Sentence to be ENCODER RNN translated PROTOPAPAS ℎ!" ℎ!& Separate FFNN DECODER RNN ATTENTION LAYER Seq2Seq + Attention Q: How do we determine how much to pay attention to each of the encoder’s hidden states? A: Let’s base it on our decoder’s previous hidden state (our latest representation of meaning) and all of the encoder’s hidden states! We want to measure similarity between decoder hidden state and encoder hidden states in some ways. ℎ!" ℎ#" ℎ$" ℎ%" ℎ!& Could be cosine similarity 𝑒# 0.9 Hidden layer 𝑁𝑁# <s> Input layer Sentence to be ENCODER RNN translated PROTOPAPAS ℎ#" ℎ!& Separate FFNN DECODER RNN ATTENTION LAYER Seq2Seq + Attention Q: How do we determine how much to pay attention to each of the encoder’s hidden states? A: Let’s base it on our decoder’s previous hidden state (our latest representation of meaning) and all of the encoder’s hidden states! We want to measure similarity between decoder hidden state and encoder hidden states in some ways. ℎ!" ℎ#" ℎ$" ℎ%" ℎ!& Could be cosine similarity 𝑒$ 0.3 Hidden layer 𝑁𝑁# <s> Input layer Sentence to be ENCODER RNN translated PROTOPAPAS ℎ$" ℎ!& Separate FFNN DECODER RNN ATTENTION LAYER Seq2Seq + Attention Q: How do we determine how much to pay attention to each of the encoder’s hidden states? A: Let’s base it on our decoder’s previous hidden state (our latest representation of meaning) and all of the encoder’s hidden states! We want to measure similarity between decoder hidden state and encoder hidden states in some ways. ℎ!" ℎ#" ℎ$" ℎ%" ℎ!& Could be cosine similarity 𝑒% −0.5 Hidden layer 𝑁𝑁# <s> Input layer Sentence to be ENCODER RNN translated PROTOPAPAS ℎ%" ℎ!& Separate FFNN DECODER RNN ATTENTION LAYER Seq2Seq + Attention Q: How do we determine how much to pay attention to each of the encoder’s hidden states? A: Let’s base it on our decoder’s previous hidden state (our latest representation of meaning) and all of the encoder’s hidden states! We want to measure similarity between decoder hidden state and encoder hidden states in some ways. Attention (raw scores) " & " " " ℎ! ℎ# ℎ$ ℎ% ℎ! Hidden layer 𝑒! 1.5 𝑒# 0.9 𝑒$ 0.3 𝑒% −0.5 <s> Input layer Sentence to be ENCODER RNN translated PROTOPAPAS DECODER RNN ATTENTION LAYER Seq2Seq + Attention Q: How do we determine how much to pay attention to each of the encoder’s hidden states? A: Let’s base it on our decoder’s previous hidden state (our latest representation of meaning) and all of the encoder’s hidden states! We want to measure similarity between Attention (raw scores) decoder hidden state and encoder hidden states in some ways. ℎ!" ℎ#" ℎ$" ℎ%" ℎ!& 𝑒! 1.5 𝑒# 0.9 𝑒$ 0.3 𝑒% −0.5 <s> Attention (softmax’d) Hidden layer Input layer Sentence to be ENCODER RNN 𝛼*! translated PROTOPAPAS DECODER RNN exp(𝑒* ) = 1 ∑* exp(𝑒!) ATTENTION LAYER Seq2Seq + Attention Q: How do we determine how much to pay attention to each of the encoder’s hidden states? A: Let’s base it on our decoder’s previous hidden state (our latest representation of meaning) and all of the encoder’s hidden states! We want to measure similarity between Attention (raw scores) decoder hidden state and encoder hidden states in some ways. ℎ!" ℎ#" ℎ$" ℎ%" ℎ!& Hidden layer 𝑒! 1.5 𝑒# 0.9 𝑒$ 0.3 𝑒% −0.5 Attention (softmax’d) <s> Input layer Sentence to be ENCODER RNN translated PROTOPAPAS DECODER RNN 𝛼(( = 0.51 𝛼*( = 0.28 𝛼+( = 0.14 𝛼+( = 0.07 Seq2Seq + Attention 𝑐!& We multiply each hidden state by 𝛼!! ℎ!" 𝛼$! 𝛼#! ℎ#" ℎ$" its 𝛼,( attention weights and then add the resulting vectors to create a context vector 𝑐(& . 𝛼%! ℎ%" 𝑦!! Hidden layer Attention (softmax’d) <s> Input layer Sentence to be ENCODER RNN translated PROTOPAPAS DECODER RNN 𝛼(( = 0.51 𝛼*( = 0.28 𝛼+( = 0.14 𝛼+( = 0.07 Seq2Seq + Attention 𝑐!& Oración 𝛼!! ℎ!" 𝛼$! 𝛼#! ℎ#" ℎ$" 𝛼%! 𝑦!! ℎ%" ℎ$% = ℎ&' Hidden layer Input layer Sentence to be ENCODER RNN translated PROTOPAPAS [𝑐(& ,<s>] DECODER RNN Seq2Seq + Attention 𝑐#& 𝛼!# ℎ!" 𝛼$# 𝛼## ℎ#" ℎ$" to be Oración por 𝑦!! 𝑦!" 𝛼%# ℎ%" Hidden layer Input layer Sentence ENCODER RNN translated PROTOPAPAS [𝑐(& ,<s>] [𝑐*& , 𝑦B( ] DECODER RNN Seq2Seq + Attention 𝑐$& 𝛼!$ ℎ!" 𝛼#$ 𝛼$$ ℎ#" ℎ$" to be Oración por traducir 𝑦!! 𝑦!" 𝑦!# 𝛼%$ ℎ%" Hidden layer Input layer Sentence ENCODER RNN translated PROTOPAPAS & [𝑐(& ,<s>] [𝑐*& , 𝑦B( ] [𝑐+ , 𝑦B* ] DECODER RNN Seq2Seq + Attention 𝑐%& 𝛼!% ℎ!" 𝛼$% 𝛼#% ℎ#" ℎ$" to be Oración por traducir </s> 𝑦!! 𝑦!" 𝑦!# 𝑦!$ 𝛼%% ℎ%" Hidden layer Input layer Sentence ENCODER RNN translated PROTOPAPAS & & [𝑐(& ,<s>] [𝑐*& , 𝑦B( ] [𝑐+ , 𝑦B* ] [𝑐1 , 𝑦B+ ] DECODER RNN Seq2Seq + Attention Attention: • greatly improves seq2seq results • allows us to visualize the contribution each word gave during each step of the decoder Image source: Fig 3 in Bahdanau et al., 2015 PROTOPAPAS Exercise: Vanilla Seq2Seq from Scratch The aim of this exercise is to train a Seq2Seq model from scratch. For this, we will be using an English to Spanish translation dataset. What’s next Lecture 7: Self attention models, Transformer models Lecture 8: BERT PROTOPAPAS THANK YOU PROTOPAPAS