·
sir
O
Les
dim
2))/d
d
!
post embeddy
TierI
are
D
.
%0
#
=
--
Free
we
the
class
O
↓
-
&
is
2
going
on
3
4
Multi-headed Self Attention Block
-
-
-
G
-
-
I
leamade
↳
Experi
#
-
di512
-
↳= S
=
-
v
-
W
lamade
-
Figure: Jay Alammar
s
heads
da
=
64
Encoder Block
-
&
Figure: Jay Alammar
d
Transformer with encoders and decoders
M
Se
-
T
-
Figure: Jay Alammar
How is the decoder different?
↑
Figure: Jay Alammar
Masked Self-attention
-
o
ein
https://rycolab.io/classes/llm-s24/
Masked Self-attention for decoder
(to avoid seeing the future tokens)
1
2345
↓
pa
self- alt
mateix
-
-
https://web.stanford.edu/~jurafsky/slp3/
·
am
I
No masking
at
inferenceime
decode
#
#
I
last
decoder
x5
X
-
#
#
#I
↳
El
e
X
starts
I
am
going
to
class
Encoder-Decoder Attention
The output of the top
encoder is transformed into
a set of attention vectors K
and V.
-
-
-
These are to be used by
each
decoder
in
its
“encoder-decoder
a
3 layer which helps
attention”
the decoder focus on
appropriate places in the
input sequence:
--
Figure: Jay Alammar
Decoding
#
The output of each step
is fed to the bottom
decoder in the next
time step, and the
decoders bubble up
their decoding results
just like the encoders
did.
We embed and add
positional encoding to
those decoder inputs
to
indicate
the
position of each word.
Figure: Jay Alammar
Converting decoder stack output to words
That’s the job of the final
Linear layer which is
followed by a Softmax
Layer.
Encodes
↑
Figure: Jay Alammar
Encoder-Decodes
Decoders)
Transformers as Language models (Decoder
-
O
-
● Masked self-attention
● No encoder-decoder attention
#Embeddimain
https://web.stanford.edu/~jurafsky/slp3/
win a
e
models
Transformers as Language models
-
-
https://web.stanford.edu/~jurafsky/slp3/
Language modeling head
https://web.stanford.edu/~jurafsky/slp3/
Ruiza
un
former
Tras
To
Transformers LM
--O
·
-
LI
o
https://web.stanford.edu/~jurafsky/slp3/
pos encodin
(Ancoder/demoder
Number of parameters in encoder-decoder?
Layers
6
#
-
Encoders
d
Decodes
V=
heads
B(h)
512X
Encodes
self-att"
-
:
: -
-
FIN
1& X512x512 X6
=
Ifm
-
%
da
=
da
-
#x21
dynd
+
mok
FFNup 2018
=
64
=
Foml (weight tying
he
+
7
512
-
(Unembedding)
40K
Embedey :
3 mx6
=
=
noded -
4dxd
Wo
+
I
Ide
-
led
>
-
encoders
Number of parameters in encoder-decoder?
Decodes
self-atte
Al
:
feed-forward
:
assumes
a
default
conf
-
encoder-decode :
Hd
16XL
4mx6
=
24m
-
24m
+ even
+
Number of parameters in the decoder only Transformer?
-
same
as
encodes
(ism
t
enbeddy
/unembedey
GPE-3
al
Teax
S
d
=
(175B)
12288
-
Try this one
(
29
=
Decoding: Greedy and Beam Search are deterministic!
● Greedy decoding as well as Beam Search decoding will give a “deterministic” output
● Other common decoding algorithms involve “sampling”, and bring in some degree
of “randomness”
-
e
-
Quality vs Diversity trade-off
&
-
-
-
Random Sampling with Temperature
-
-
-
u:
[3
,
-
E
=
!-1]
S
0 2
etes e-
.
-
0
:
99
Why does this work?
-
-
Without scaling, random
sampling
often
generates
incoherent
gibberish
https://utah-cs6340-nlp.notion.site/Natural-Language-Processing-bd1a2ca290fc44f69556908ad8d25c70