Uploaded by Sheng Sin

20220608 ssii transformer

advertisement
Transformerの最前線
〜 畳込みニューラルネットワークの先へ 〜
オムロン サイニックエックス
牛久 祥孝
losnuevetoros
SSII 2022͹ࡏ͹නࢶ
Transformerभਈ৐଍
‫ ع‬႙੢ाॽগ‫ش‬ছঝॿॵॺড‫ش‬ॡभ੔ष ‫ع‬
ฟ୲ ໋ේ‫ق‬ड़঒টথ१ॖॽॵॡग़ॵॡ५‫ك‬
ঽഞງஂ
2013.6‫ع‬2013.8
2014.3
2014.4‫ع‬2016.3
2016.4‫ع‬2018.9
2016.9‫ع‬
2016.12‫ع‬2018.9
2018.10‫ع‬
2019.1‫ع‬
2020.4‫ع‬
2021.7‫ع‬
Microsoft Research Intern
઺൸य़কউ३ঙথেਛ ৿઺भ્৒ય৑ध
[Ushiku+, ACMMM 2012]
य़কউ३ঙথभৼ൩ਫ਼ด
[Ushiku+, ICCV 2015]
[Yamaguchi+, ICCV 2017]
೗૒(ੲਾ৶ੵ৾)‫ূؚ‬਎প৾
NTT CSଢ଼ ଢ଼஢৩
ূ਎প৾ ൥ప (ਉিฟ୲ଢ଼஢஼)
ਓ঵ૼ୒੕়ଢ଼஢ਚ ੈৡଢ଼஢৩
A yellow train on the tracks
A guy is skiing with no shirt on
near
a
train
station.
and yellow snow pants.
বয়বୁଢ଼஢ਚ ુ৊ଢ଼஢৩
ड़঒টথ१ॖॽॵॡग़ॵॡ५ઙૄভ঺ Principal Investigator
ઙૄভ঺ Ridge-i ਄ഁ૽ Chief Research Officer
ஸি྆প৾ శଞඐ൥ప
ূਨপ৾ శଞඐ൥ప
ਚര੮৬
ACM‫ؚ‬IEEE‫ؚ‬ਗ਼৕ੲਾৢਦ৾ভ‫ؚ‬ੲਾ૪৶৾ভ‫ؚ‬঩মট঎ॵॺ৾ভ‫ؚ‬যੵੴચ৾ভ‫ؚ‬
ૢ৷੟৶৾ভ‫ؚ‬૦ണੲਾ৾ভ‫ؚ‬঩মॹॕ‫ش‬উছ‫ॽش‬থॢੈভ
ম൥౰पणःथ
Transformerभ੦ম৓ऩ৿੿ऊैૢ৷෇೧‫ؚ‬ਈ੺भMLP௺ॿॵॺড‫ش‬ॡ
ऽदभᖊᰐ
•
•
•
•
जुजुTransformerढथ‫ء‬
Transformerᄑ௯
Transformerमड़ড॥থ‫آء‬
Transformerभঀक़ঁक़
ବম൥౰भৱમभਈৗং‫ش‬४ঙথ
‫ق‬१ঈॱॖॺঝपुൕൗखथउॉऽघ‫ك‬
౦‫ୂر‬ऌघऍथଐऎীऊैऩः‫آ‬धःअযभञीप
• Transformerमॺ‫ش‬ॡথ‫ृୁ౐ق‬ঃॵॳऩन‫॑ك‬Residualமਢखऩऋ
ै2णभঔ४গ‫ش‬ঝ॑೷ॉନखथःॊटऐ‫آ‬
– ॺ‫ش‬ॡথ॑೴छॊToken Mixer
– ॺ‫ش‬ॡথ॑૗ఌघॊMLP
• ਵ਻पऩढञMLP௺ुৰम੦ম৓प৊गଡୗ‫آ‬
ਡ଍৖॑ളਯ৚೷ॉନघ
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
ঋॡॺঝ॑
ऽछथ૗ఌ
ਫૠ৲
࢞
ਫૠ৲
࢞
ঋॡॺঝ॑
଻શप૗ఌ
࢞
• …ँधम౦‫ر‬धঀक़ঁक़ऋँॊभदਞ॑હऐऽखॆअ
そもそもTransformerって?
Transformer௺૛ધदेऎৄॊ௕
Transformer௺૛ધदेऎৄॊ௕
ᴍჺऩ௕टख
ী
ীऊॉृघऎৄइॊ
ऐनীऊॉपऎः
ͤ଻যभ૎୳दघ
ͤ
Transformerधम
ભਔभ଻ਯभঋॡॺঝ॑ભਔभ଻ਯभঋॡॺঝप૗ఌघॊ
ॿॵॺড‫ش‬ॡ
঳ᅖ૮ଳ
Transformerधम
ભਔभ଻ਯभঋॡॺঝ॑ભਔभ଻ਯभঋॡॺঝप૗ఌघॊ
ॿॵॺড‫ش‬ॡ
লৡ‫؟‬
ભਔभ଻ਯभঋॡॺঝ
঳ᅖ૮ଳ
োৡ‫؟‬
ભਔभ଻ਯभঋॡॺঝ
Positional Encoding??
Positional Encoder
োৡঋॡॺঝभਜ਼઼धলৡঋॡॺঝभਜ਼઼॑਀घঋॡॺঝ
• ౐ୁभীങ਀ਠऩैَધরभ୦౐ୁ৯‫ُء‬
• ઺൸भ્ඉ୤ऩैَ઺൸৔भनऒभౠఏ‫ُء‬
• ৎ௺ഔभ્ඉ୤ऩैَ୦ଧৎਡभੲਾ‫ُء‬
[Vaswani+, NIPS’17]दम౐ୁभਜ਼઼॑
१ॖথ‫؞‬॥१ॖথदঋॡॺঝ৲
ମణभঐ‫ش‬ॡम१ॖথ‫؞‬॥१ॖথभਔ௡
RNNृCNNम૚્ඉ୤भਜ਼઼ੲਾ॑ੴढथःॊ
• َRNNुCNNु૚ਏಞभৼৌਜ਼઼॑ੴढथःॊُ [Shaw+, NAACL-HLT’18][Dai+, ACL’19]
• َCNNम૚ਏಞभബৌਜ਼઼॑৾ಆखथःॊُ[Islam+, ICLR’20]
઺൸भႀ෩ਙ௓৒द‫઺ؚ‬൸৸৬॑োৡ vs. ঳৖भ઺൸॑োৡମजोझो”઺൸भরఙ”मႀ෩ਙऋপऌः
Autoregressive vs. Non-autoregressive
লৡभঋॡॺঝ॑…
• Autoregressive= 1଻ङण௓৒घॊ
– োৡभঋॡॺঝණधলৡੋाभঋॡॺঝණ॑৷ःथ
ৗञऩঋॡॺঝ॑লৡ
– َીॎॉُ॑ਔ௡घॊ౐ୁऋলॊऽद೷ॉନघ
– ੪भTransformer [Vaswani+, NIPS’17] मAutoregressive
– ಖ২मଐःऋ೚ः
࢑ + ૚଻৯भ
লৡঋॡॺঝ
঳ᅖ૮ଳ
࢑଻भলৡੋा
ঋॡॺঝ
ࡷ଻भ
লৡঋॡॺঝ
• Non-autoregressive=৸৖঳ਞप௓৒घॊ
– ୦ैऊभ্১दളਯ଻भলৡঋॡॺঝ৷भ
َॱॿُधऩॊঋॡॺঝ॑োৡ[Gu+, ICLR’18][Guo+, AAAI’19]
• ధഔद1৚दੑ઴दऌॊभदசःऋ‫ؚ‬ಖ২ऋ଩ৣ
– ঳ਞप௓৒घॊभ॑೷ॉନखथ౐ୁਯ॑৹ତ
[Ghazvininejad+, EMNLP’19][Gu+, NeurIPS’19]
঳ᅖ૮ଳ
ࡷ଻भَॱॿُ
ঋॡॺঝ
ऌढधऒऒऽदमok
঳ᅖ૮ଳ
ઃम૮ଳखथःञ৖ীभরମ
ऒऒम
• Residualமਢ
• ૚ঋॡॺঝपMLP॑ి৷
ٙ1x1 conv धಉ੼
धःअ௕मऒऒभ
Multi-Head Attention॑ହ৥
घॊञीभ௕
ઃम૮ଳखथःञ৖ীभরମ
ऒऒम
• Residualமਢ
• ૚ঋॡॺঝपMLP॑ి৷
ٙ1x1 conv धಉ੼
धःअ௕मऒऒभ
Multi-Head Attention॑ହ৥
घॊञीभ௕
धःअ௕मऒऒभ
Scaled Dot-Product Attention
॑ହ৥घॊञीभ௕
Multi-head Attention
उऔैः‫ ؟‬ඞअभमભਔभ଻ਯभঋॡॺঝ
• ঽே੉ୁ૪৶दँोय౐ୁभীങ਀ਠभ௺ഔٔਜ਼઼ੲਾ
࢞
Ougai
࢞
,
࢞
you
࢞
look
࢞
like
࢞
the
࢞
cat
࢞
that
࢞
ate
࢞
the
࢞
canary
࢞
.
• ઺൸ੳ௙दँोयॱॸ¼চ॥पॳকॿঝਯীभ
ઃ੪भঋॡॺঝ‫ق‬ଂਚ્ඉ୤‫ك‬ऋ
ధ॒टਯഔٔॱॸ‫؞‬চ॥भਜ਼઼ੲਾ
⌢ਗऔ॒١
Multi-head Attention
उऔैः‫ ؟‬ඞअभमભਔभ଻ਯभঋॡॺঝ
• ঽே੉ୁ૪৶दँोय౐ୁभীങ਀ਠभ௺ഔٔਜ਼઼ੲਾ
࢞
Ougai
࢞
,
࢞
you
࢞
look
࢞
like
࢞
the
࢞
cat
࢞
࢞
that
࢞
ate
the
• ઺൸ੳ௙दँोयॱॸ¼চ॥पॳকॿঝਯীभ
࢞
ઃ੪भঋॡॺঝ‫ق‬ଂਚ્ඉ୤‫ك‬ऋ
࢞
ధ॒टਯഔٔॱॸ‫؞‬চ॥भਜ਼઼ੲਾ
࢞
࢞
࢞
canary
࢞
࢞
.
࢞
࢞
࢞
࢞
࢞
࢞
࢞
⌢ਗऔ॒١
ऒभঋॡॺঝ
࢞
पି৯खथઅइॊ
1. ౎भঋॡॺঝَ॑ਫ਼ดُघॊञीभॡग़জࢗ = ܹ ொ ࢞॑ੑ઴
ொ
ࢗ= ܹ ࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
ऒभঋॡॺঝ
࢞
पି৯खथઅइॊ
2. َథ๚ਫ਼ดُৌ଴धखथभ૚ঋॡॺঝभय़‫ ܹ = ࢑ش‬௄ ࢞॑ੑ઴
࢑
࢑
࢑
࢑
࢑
࢑
࢑
࢑
࢑
࢑
࢑
࢑
࢞
࢞
࢞
࢞
࢞
࢞
ࢗ
࢞
࢞
࢞
࢞
࢞
࢞
ऒभঋॡॺঝ
࢞
पି৯खथઅइॊ
3. ॡग़জधय़‫ش‬भ৔஋॑ઃ੪ਯ݀धsoftmaxदਫૠ৲खञَథ๚২ُ
ܽ = softmax ࢗୃ ࢑/ ݀ ॑ੑ઴
=
ܽ
=
ܽ
=
ܽ
=
ܽ
=
ܽ
=
ܽ
=
ܽ
=
ܽ
=
ܽ
=
ܽ
=
ܽ
=
ܽ
࢑
࢑
࢑
࢑
࢑
࢑
࢑
࢑
࢑
࢑
࢑
࢑
ࢗ
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
ऒभঋॡॺঝ
࢞
पି৯खथઅइॊ
4. ૚ঋॡॺঝभ৻਀க࢜ = ܹ ௏ ࢞॑ੑ઴
࢜
ܽ
࢜
ܽ
࢜
ܽ
࢜
ܽ
࢜
ܽ
࢜
ܽ
࢜
ܽ
࢜
ܽ
࢜
ܽ
࢜
ܽ
࢜
ܽ
࢜
ܽ
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
ऒभঋॡॺঝ
࢞
पି৯खथઅइॊ
5. థ๚২ܽद੎ातऐखञਮσܽ࢜॑ੑ઴ମঋॡॺঝ
ಌৗ৏भঋॡॺঝ
࢞
पਸ઴‫ق‬residualமਢથ‫ك‬
࢞
࢜
‫ڄ‬
ܽ
࢜
‫ڄ‬
ܽ
࢜
‫ڄ‬
ܽ
࢜
‫ڄ‬
ܽ
࢜
‫ڄ‬
ܽ
࢜
‫ڄ‬
ܽ
࢜
‫ڄ‬
ܽ
࢜
‫ڄ‬
ܽ
࢜
‫ڄ‬
ܽ
࢜
‫ڄ‬
ܽ
࢜
‫ڄ‬
ܽ
࢜
‫ڄ‬
ܽ
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
Scaled Dot-Product Attention
धःअ௕पჾਊ
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
ুದ1~5॑घसथभঋॡॺঝपৌखथৰষघॊ
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
Multi-head Attentionदम
धःअ௕पჾਊ
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
ষഔܹ ொ , ܹ ௄ , ܹ ௏ ݄॑଻৷ਔख‫ؚ‬जोझो॑৷ःऩऋै
ুದ1~5॑घसथभঋॡॺঝपৌखथৰষघॊ
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
Transformer ऽधी
ભਔभ଻ਯभঋॡॺঝ॑ Encoder-Decoder ஄ૄद૗஄घॊૼ୒
N৚೷ॉନघ
୻इयGPT-3दम
• 96৚೷ॉନघ
• ૚৚96-headsभ
Attention‫ق‬128ઃ
੪‫ك‬
ମঃছও‫ॱش‬ਯ 175B
ͤ2022ফ4াपGoogle
ऋਁ৫खञPaLMम540B
[Chowdhery+, 2022]
‫ق‬Attentionभ૮ः‫ك‬CNNृRNNधभୀः
Transformerम…
৸࢔଻भঋॡॺঝऊै৸࢔଻भঋॡॺঝषभ॔ॸথ३ঙথ॑ੑ઴
ܱ(݊ଶ )टऋ঳୞ઁୠपੲਾऋট५૮ऎ஫୸૭ચ
CNNम…
RNNम…
3णऩन‫ؚ‬৸৬ऊैघोय૘ਯभ
ঋॡॺঝ॑঳णङण௒ਪखऩऋै
ঋॡॺঝटऐभ႙੢ाੑ઴॑௒ਪ
৔৖भ७ঝप૗ਯ‫੶ق‬༨‫॑ك‬৳ோ
ܱ ݊ टऋੲਾऋ஫ॎॊभम੺ๆटऐ
ܱ ݊ टऋশः௺ഔमਂ੭ু
CNN
࢞
࢞
࢞
࢞
࢞
RNN
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
Transformer旋風
Computer Visionਰਗ
• ঽே੉ୁ૪৶
– ᄽ๨ऊणਉ෩ [Vaswani+, NIPS’17] जभ౎੗ਯ‫آآآ‬
– ੉ୁঔॹঝ GoogleभBERT [Devlin+, NAACL-HLT’19], PaLM [Chowdhery+, 2022], DeepMindभGopher
[Rae+, 2022], OpenAIभGPT-2/3 [Radford+, 2019][Brown+, NeurIPS’20] ऩन੗ਯ
– 2ூॺ‫ش‬ॡথऊैऩॊॹ‫ॱش‬ঋ‫ش‬५भਫ਼ด [Borgeaud+, 2021]
• ఠଢ૪৶‫؞‬ਦಀ૪৶
–
–
–
–
਀ਠ৾ಆ HuBERT [Hsu+, TASLP’21], SSAST [Gong+, AAAI’22]
ఠଢੳ௙ [Lüscher+, INTERSPEECH’19]
ఠ௫েਛ [Huang+, ICLR’19][Choi+, ICML’20]
ৎ௺ഔ੒೾ [Li+, NeurIPS’19][Wu+, NeurIPS’20]
• ॸ‫ش‬ঈঝॹ‫ॱش‬
– FT-Transformer [Gorishniy+, NeurIPS’22] ͤञटख਀ॹ‫ॱش‬मൂேधखथGradient Boostingऋਘः
• Bio/Chem-informatics
– ী৕ଡୗੰෲ [Fuchs+, NeurIPS’20][Rong+, NeurIPS’20]
• ग़‫ش‬४ख़থॺ‫؞‬ট঎ॸॕॡ५
– ঐঝॳग़‫ش‬४ख़থॺৢਦ [Inala+, NeurIPS’20]
– One Shotदெኼ৾ಆ [Dasari+Guputa, CoRL’20]
– ॱ५ॡ௺ഔभਘ৲৾ಆ Scene Memory Transformer [Fang+, CVPR’19], Decision Transformer [Chen+,
NeurIPS‘21], Trajectory Transformer [Janner+, NeurIPS‘21], Gato [Reed+, 2022]
Vision & Language
਀ਠ৾ಆମ੦ೕঔॹঝ
VideoBERT [Sun+, ICCV’19]
LXMERT [Tan+Bansal, EMNLP’19]
ViLBERT [Lu+, NeurIPS’19]
VL-BERT [Su+, ICLR’20]
UNITER [Chen+, ECCV’20]
OSCAR [Li+, ECCV’20]
Voken [Tan+Bansal, EMNLP’20]
COOT [Ging+, NeurIPS’20]
Perceiver [Jaegle+, ICML’21]
PolyViT [Likhosherstov+, 2021]
Flamingo [Alayrac+, 2022]
[Tan+Bansal, EMNLP’20]
૞ස਀ਠ৶ੰ
TextVQA [Kant+, ECCV’20]
य़কউ३ঙথেਛ
৿઺ਫ਼ด [Gabeur+, ECCV’20]
MDETR [Kamath+, ICCV’21]
[Zhou+, CVPR’18][Li+, ICCV’19][Cornia+,
CVPR’20]
[Cornia+, CVPR’20]
He starts his motorbike
Then walks away.
௅િ௓৒௺
ুभ௅િ [Huang+, ECCV’20]
Proposed
ুਵभੳ௙ [Saunders+, ECCV’20]
୩ୠীસ௺
Panopitc Segmentation
[Kirillov+, ECCV’20]
Proposed
Semantic Segmentation पु
Instance Segmentation पु [Zhang+, ECCV’20]
ି‫؟‬xxxx Segmentationऋञऎऔ॒ँॊ‫ءءء‬
[Kirillov+, CVPR 2019]
৿઺൸৶ੰ
৽ଡ଼੒೾ [Yu+, ECCV’20]
৿੿ੳ௙ [Girdhar+, ECCV’20]
जभ౎भ୻
઺൸েਛঔॹঝ
தੰ൸ٔଓ౥ [Parmar+, ICML’18]
தੰ൸ [Yang+, CVPR’20]
્৒੟৬ਫ਼ด
Fine-grained௙શ
[Kim+, ECCV’20]
५ॣॵॳदਫ਼ด [Ribeiro+, CVPR’20]
੟৬ਫ਼ল‫ق‬DETR‫ك‬
੟৬ਫ਼ল‫ق‬ం‫ك‬धPanoptic Segmentation‫ق‬క‫[ك‬Carion+, ECCV’20]
੟৬ਫ਼লम[Chi+, NeurIPS’20]ु઀੧
Vision Transformer [Dosovitskiy+, ICLR’21]
• ઺൸ऊैTransformerभाद৾ಆघॊध
– ResNet152ಽधऺऻ৊ಉभಖ২ऊण25%ங২भ৾ಆৎ৑
– JFT-300Mधःअপૠெॹ‫ॱش‬७ॵॺऋ૑ਏ
– ੴ௙Ⴊ೏॑ੌा়ॎचथ‫ؚ‬ImageNetभ1000ॡছ५઺൸ॹ‫ॱش‬भाभ৾ಆदु
EfficientNet॑தइञDeiTुથ੡ [Touvron, ICML’21]
• ग़থ॥‫ॲش‬भाभTransformer
– ৰमड़জ४ॼঝभTransformerेॉुଡୗऋ౐ෞद৶ੰुઍಔ
MobileViT
• রఙभঈটॵॡदम…
[Mehta+Rastegari, ICLR’22]
݀
Unfold
݀
– ૚ঃॵॳ॑ంక্਱पன৫ (Unfold) ݄
‫ݓ‬
– ອप५ছॖ५खञঋॡॺঝૐ়൐पTransformer॑ి৷
ܲ = ‫݄ݓ‬
• ੉ःఌइॊध‫ؚ݀‬ઃ੪ঋॡॺঝऋܰ଻ँॊૐ়঱दभTransformer॑P৚ి৷
– ुअ঳২૚ঃॵॳ॑ਫ্஄पರघ (Fold)
MV2: MobileNet v2 block
ଯ2: down sampling
MobileViT
• ImageNetभ1000ॡছ५઺൸ॹ‫ॱش‬टऐद৾ಆ
౎भMobile௺ॿॵॺড‫ش‬ॡ
ेॉৈಖ২
[Mehta+Rastegari, ICLR’22]
౎भViT௺ॿॵॺড‫ش‬ॡ
ेॉ૘ঃছও‫&ॱش‬ৈಖ২
Vision Transformerद੟৬ਫ਼লृ७ॢওথॸ‫ش‬३ঙথ॑ষअपम
• ViTभेअऩ႗ःঃॵॳटऐटध…಍ऊःংक़থ
ॹॕথॢ঎ॵॡ५भਜ਼઼ৠीृ७ॢওথॸ‫ش‬
३ঙথभಖ২ऋ૳ठॊ
[Dosovitskiy+, ICLR’21]
• ঳্द಍ऊःঃॵॳ॑੗ऎ੿ढथखऽअध‫ؚ॔‬
ॸথ३ঙথभੑ઴दৎ৑ऋऊऊॊ‫ق‬ঃॵॳਯभ2
ଭड़‫كشॲش‬
Swin Transformer
• CNNदउऩगाभআছ঑ॵॻଡୗ॑੅ठ੢ि
Best paper!
[Liu+, ICCV’21]
– ॔ॸথ३ঙথੑ઴॑૚ট‫ش‬ढ़ঝक़ॕথॻक़प଒৒खथੑ઴୤చ੖
– ऒोैभঞॖখ‫ش‬ध૘खયજॉ॑ङैखञঞॖখ‫॑ش‬ઐ൩पგि
ମট‫ش‬ढ़ঝक़ॕথॻक़॑தइञਭઍ৙॑୸ਛ
– ੟৬ਫ਼লद௬੼‫ق‬७ॢওথॸ‫ش‬३ঙথमSwin-Unet [Cao+, 2021]‫ك‬
Vision Transformer ध CNN भଡୗૻຎ
• ViTधCNNभ৔৖਀ਠଡୗ
[Raghu+, NeurIPS’21]
– ViTमेॉ಑঳ऩ਀ਠ॑੅ठ‫ৣؚ‬ਜ਼ಽध঱ਜ਼ಽभథ๚ਙऋৈः
– ૏ঔॹঝपႀ෩ऩ୷౮ऋँॊ
• ଂਚ৓‫؞‬পୠ৓ऩੲਾभਹ৷্১
– ViTमResNetेॉुপୠ৓ऩੲਾ॑଩ঞॖখ‫ش‬प਄ॉ੢॒दःॊ
– ञटखৣਜ਼ಽपଂਚ৓ऩੲਾ॑਄ॉ੢िऒधमൂேधखथ੎ਏद‫ؚ‬
পૠெऩহ৐৾ಆॹ‫ॱش‬ऋ૑ਏ
• ViTभॿॵॺড‫ش‬ॡଡୗ
– ViTभ५य़ॵউமਢमResNetsेॉुऔैप୶஭ৡऋਘऎ‫ؚ‬ਙચध
਀ਠभథ๚ਙपਘः୶஭॑ଖइॊ
• ViTभਜ਼઼ੲਾ
– ViTभ্ऋ઺൸৔दभਜ਼઼ੲਾऋ৳੅औोथःॊ
Transformerはオワコン?!
MLP-Mixer
[Tolstikhin+, NeurIPS’21]
gMLP
[Liu+, NeurIPS’21]
gMLPपेॊ઺൸ীథಖ২ૻຎदम
ऺध॒नघसथभ
CNN઺൸ীథ
ेॉु
Transformerभ
઺൸ীథ
ेॉु
‫౎ق‬भMLP௺ेॉु‫ك‬
gMLPभ্ऋ
ಖ২ऋଐः/
ঃছও‫ॱش‬ऋ૘ऩः
दলथऎॊ૎୳ऋ
Transformerढथड़ড॥থदम‫ءآ‬
結論から言うと
TransformerとMLP-MixerとgMLPは
本質的にやっていることは余り違いません
単にAttention is All You Needではなかっただけです
Transformer
[Vaswani+, NIPS’17]
Vision Transformer
[Dosovitskiy+, ICLR’21]
ਏघॊप
ਡ଍৖॑ളਯ৚೷ॉନघ
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
ঋॡॺঝ॑
ऽछथ૗ఌ
ਫૠ৲
࢞
࢞
ਫૠ৲
࢞
ঋॡॺঝ॑
଻શप૗ఌ
MLP-Mixer
[Tolstikhin+, NeurIPS’21]
ਏघॊप
ਡ଍৖॑ളਯ৚೷ॉନघ
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
ঋॡॺঝ॑
ऽछथ૗ఌ
ਫૠ৲
࢞
࢞
ਫૠ৲
࢞
ঋॡॺঝ॑
଻શप૗ఌ
gMLP
[Liu+, NeurIPS’21]
ਏघॊप
ਡ଍৖॑ളਯ৚೷ॉନघ
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
ঋॡॺঝ॑
଻શप૗ఌ
ਫૠ৲
࢞
࢞
ਫૠ৲
࢞
ঋॡॺঝ॑
ऽछथ૗ఌ
ঋॡॺঝ॑
଻શप૗ఌ
৊गधऒौ vs. ૗ॎढञधऒौ
• ा॒ऩ৊गधऒौ
– ঋॡॺঝभૐ়॑૗ఌघॊঔ४গ‫ش‬ঝ॑೷ॉନखి৷घॊ
– ૗ఌमঋॡॺঝ॑ऽछथ૗ఌघॊऊ‫ؚ‬ঋॡॺঝ॑଻શप૗ఌघॊ
ऊभ2ৢॉ
– ঋॡॺঝ॑଻શप૗ఌघॊ্১मMLP
– ෙ୷଎ଷृೌ৅॑ଆएञीभਫૠ৲‫ق‬Layer Normalization‫ك‬
– ৊ग৯৓द଑োऔोथःॊSkip Connection
• Transformerऊै૗ॎढञधऒौ
– ঋॡॺঝ॑ऽछथ૗ఌघॊ্১ऋAttentionऊैষഔ஋पऩढञ
– ঋॡॺঝभਜ਼઼ੲਾ॑ঋॡॺঝঽମऋ৳੅‫ق‬Transformer‫ك‬
ମঋॡॺঝभॖথॹॵॡ५पಡचथॿॵॺড‫ش‬ॡऋ৳੅‫ق‬MLP௺‫ك‬
৊गधऒौ vs. ૗ॎढञधऒौ
• ा॒ऩ৊गधऒौ
– ঋॡॺঝभૐ়॑૗ఌघॊঔ४গ‫ش‬ঝ॑೷ॉନखి৷घॊ
– ૗ఌमঋॡॺঝ॑ऽछथ૗ఌघॊऊ‫ؚ‬ঋॡॺঝ॑଻શप૗ఌघॊ
gMLPभ૛ધ৔भਾઔ
ऊभ2ৢॉ
َॿॵॺড‫ش‬ॡ৔मऺध॒नMLPटऐन‫ؚ‬
– ঋॡॺঝ॑଻શप૗ఌघॊ্১मMLP
ठॆढधटऐAttentionোोञু১‫ق‬aMLP‫ك‬म
Normalization‫ك‬
– ෙ୷଎ଷृೌ৅॑ଆएञीभਫૠ৲‫ق‬Layer
औैपৈਙચُ
– ৊ग৯৓द଑োऔोथःॊSkip Connection
• Transformerऊै૗ॎढञधऒौ
– ঋॡॺঝ॑ऽछथ૗ఌघॊ্১ऋAttentionऊैষഔ஋पऩढञ
– ঋॡॺঝभਜ਼઼ੲਾ॑ঋॡॺঝঽମऋ৳੅‫ق‬Transformer‫ك‬
ମঋॡॺঝभॖথॹॵॡ५पಡचथॿॵॺড‫ش‬ॡऋ৳੅‫ق‬MLP௺‫ك‬
ऩ॒थऒध॑2021ফႃऊै੉ढथःञै…
MetaFormer [Yu+, CVPR’22]
• MLP-like modelधTransformerभୀःम
Token Mixerभ৖ীटऐ
ମऽऔपَঋॡॺঝ॑ऽछथ૗ఌُ
• ೴छॊभPoolingदुଐऎऩः‫ء‬
• ImageNetभ1000ॡছ५ॹ‫ॱش‬भ৾ಆद
ViT௺ृMLP௺भু১ेॉुৈಖ২टे
ऩ॒थऒध॑2021ফႃऊै੉ढथःञै…
HyperMixer [Mai+, 2022]
• MLP-MixerधTransformerभୀःम
Token Mixingभ৖ীटऐ
ମऽऔपَঋॡॺঝ॑ऽछथ૗ఌُ
• MLP-Mixerधୀढथਜ਼઼ਂ૗ऩToken
Mixing॑ষअHyperMixer॑੿ढञे
• ঽே੉ୁ૪৶भ૚ॱ५ॡदଐ஀ऩಖ২
॑ৰਠखञे
もう一度結論を言うと
TransformerとMLP-MixerとgMLPは
本質的にやっていることは余り違いません
単にAttention is All You Needではなかっただけです
Transformerのノウハウ
Transformerभ੦মਙચ਱঱॑ੱऋऐञཾ඄
•
•
•
•
•
•
•
•
•
•
ELU [Clevert+, ICLR 2016]
GeLU [Hendrycks+Gimpel, 2016]
Swish [Ramachandran+, ICLR WS 2018]
SELU [Klambauer+, NIPS 2017]
GLU [Dauphin+, ICML 2017]
RMS [Zhang+Sennrich, NeurIPS 2019]
ReZero [Bachlechner+, 2020]
Fixup [Zhang+, ICLR 2019]
Adaptive Softmax [Joulin+, ICML 2017]
Mixture of Softmaxes [Yang+, ICLR 2018]
…ऋ૮ྤपਢऐैोथःॊ੺ফ‫آء‬
Transparent Attention [Bapna+, EMNLP 2018]
Evolved Transformer [So+, ICML 2019]
Synthesizer variants [Tay+, 2020]
Funnel Transformer [Dai+, NeurIPS 2020]
Lightweight and Dynamic convolution [Wu+, ICLR 2019]
Mixture of Experts Transformer [Shazeer+, NeurIPS
2018][Lepikhin+, ICLR 2021]
• Switch Transformer [Fedus+, 2021]
• Product Key Memory [Lample+, NeurIPS 2019]
• Universal Transformer [Dehghani+, ICLR 2019]
•
•
•
•
•
•
ॶॵ॥঑੪पणःथ
•
•
•
•
Google Researchभরभয16যद
30॑தइॊTransformerभ੝ఒু১॑
50॑தइॊংজग़‫ش‬३ঙথध
9भ௴ਡद௬੼खञ੥ટ
প઄भু১ऋ੪भTransformerधপ୷ऩऊढञेध੉अਵ
੝ఒ্১भীథ
1.
2.
3.
4.
5.
6.
7.
ણਙ৲ঢ়ਯ
ਫૠ৲
஥औ
ඇी੢ा
ঃছও‫ુॱش‬થ
Softmaxभ੝ଐ
৸৬भ॔‫ش‬य़ॸॡॳক
ͤ ૑ङखुTransformerभ੝ఒ॑ਔ௕खथ৅਀औोञু১टऐ
दम૮ःभदିਔ
1. ણਙ৲ঢ়ਯभ੝ఒभഄఴ
• ELU [Clevert+, ICLR 2016]
– ଀भ৖ীटऐ੐ਯঢ়ਯद-1पሰ੺ख‫ؚ‬੦ম৓पෟैऊऩ஄॑धॊ‫؛‬
– ଺ਬ৷ਯ3000த‫آ‬
• GeLU [Hendrycks+Gimpel, 2016]
– ReLUप๚ञ஄टऐन଑ঢ়ਯऋෟैऊ‫؛‬
– BERTपुGPT-3पुઞॎोथःॊऐनICLRऊैमজ४ख़ॡॺऔोथःॊ‫୑ཱུ؛‬ो‫؛‬
• Swish [Ramachandran+, ICLR WS 2018]
– ReLUप๚ञ஄टऐन଑ঢ়ਯऋෟैऊ‫ँ؛‬ो੉ढथःॊऒधऋGeLUध૗ॎैऩः‫؛‬
• SELU [Klambauer+, NIPS 2017]
– Self-Normalizing Neural Networks॑઀੧घॊ૛ધभ঳৖‫؛‬ELU॑ਫभ৖ীदु଀भ৖
ীदु৒ਯ೅खञुभ‫؛‬
– SELUঽ৬भ௬੼मষॎोथःऩः‫؛‬ਪഭ঻ुॶॵ॥॒दःॊभप୦௛ৢढञ‫؛‬
• GLU [Dauphin+, ICML 2017]
– LSTMभGate৖ী॑੅ढथऌञણਙ৲ঢ়ਯ‫؛‬
– ଍஄૗ఌٔણਙ৲ঢ়ਯ‫ق‬३ॢঔॖॻ‫ৢ॑ك‬खञ0~1भக॑ुणGateध‫ؚ‬શಥੑ઴खञ
଍஄૗ఌभਏಞ஋‫؛‬
1. ણਙ৲ঢ়ਯभ੝ఒभഄఴ
ਏघॊप…
• ELU [Clevert+, ICLR 2016
2016]
ReLU
– ଀भ৖ীटऐ੐ਯঢ়ਯद-1पሰ੺ख‫ؚ‬੦ম৓पෟैऊऩ஄॑धॊ‫؛‬
଀भ৖ীटऐ੐ਯঢ়ਯद-1
1प
पሰ੺ख‫ؚ‬੦ম৓पෟ
ෟैऊ
ऊऩ஄॑धॊ‫؛‬
– ଺ਬ৷ਯ3000த‫آ‬
• GeLU [Hendrycks+Gimpel,
[Hendrycks+Gimpe 2016]
– ReLUप๚ञ஄टऐन଑ঢ়ਯऋෟैऊ‫؛‬
ReLUप๚ञ஄टऐन଑ঢ়ਯ
ਯऋෟैऊ‫؛‬
BERTपुGPT-3पुઞॎो
ोथ
थःॊऐनICLRऊैमজ४
४ख़ॡॺऔोथःॊ‫୑ཱུ؛‬ो‫؛‬
– BERTपुGPT-3पुઞॎोथःॊऐनICLRऊैमজ४ख़ॡॺऔोथःॊ‫୑ཱུ؛‬ो‫؛‬
• Swish [Ramachandran+, ICLR WS 2018]
– ReLUप๚ञ஄टऐन଑ঢ়ਯऋෟैऊ‫ँ؛‬ो੉ढथःॊऒधऋGeLUध૗ॎैऩः‫؛‬
• SELU [Klambauer+,
NIPS 2017]
017]
଀भ৖ীऋৣऋॊ
প৬ReLUदෟैऊ
– Self-Normalizing
Sellf-N
Normalizing
g Neurall Networks॑઀੧घॊ૛ધभ঳৖‫؛‬ELU॑ਫभ৖ীदु଀भ৖
Netw
w
works॑઀੧घॊ૛ધभ
भ঳৖‫؛‬ELU॑ਫभ৖
৖ীदु
ु଀भ৖
ীदु৒ਯ೅खञुभ‫؛‬
ীद
दु৒ਯ೅खञुभ
भ‫؛‬
– SELUঽ৬भ௬੼मষॎोथःऩः‫؛‬ਪഭ঻ुॶॵ॥॒दःॊभप୦௛ৢढञ‫؛‬
SELLU
Uঽ৬भ௬੼मষ
ষॎोथः
ःऩः‫؛‬ਪഭ঻ुॶॵ॥॒दःॊभप୦௛ৢढञ‫؛‬
ः
॥॒दःॊभप୦
୦௛ৢढञ
ञ
• GLU [Dauphin+, ICML 2017]
– LSTMभGate৖ী॑੅ढथऌञણਙ৲ঢ়ਯ‫؛‬
LST
TM
MभGatte৖ী॑੅ढथऌञ
ञણਙ৲ঢ়ਯ‫؛‬
ञ
଍஄૗ఌٔણਙ৲ঢ়ਯ‫ق‬३ॢঔॖॻ‫ৢ॑ك‬खञ0~
஄૗
૗ఌٔણਙ৲ঢ়ਯ‫ق‬३ॢ
1भ
भக॑ुणG
Gate
eध
ध‫ؚ‬શಥੑ઴खञ
‫ؚ‬શ
શಥੑ
– ଍஄૗ఌٔણਙ৲ঢ়ਯ‫ق‬३ॢঔॖॻ‫ৢ॑ك‬खञ0~1भக॑ुणGateध‫ؚ‬શಥੑ઴खञ
଍஄૗ఌभਏಞ஋‫؛‬
ELU/SELU
GeLU/Swish
1. ણਙ৲ঢ়ਯभ੝ఒभഄఴ
जखथवढठूऐ…
• ELU [Clevert+, ICLR 2016]
– ଀भ৖ীटऐ੐ਯঢ়ਯद-1पሰ੺ख‫ؚ‬੦ম৓पෟैऊऩ஄॑धॊ‫؛‬
पሰ੺ख‫ؚ‬੦ম৓पෟै
– ଺ਬ৷ਯ3000த‫آ‬
• GeLU [Hendrycks+Gimpel,
[Hendrycks+Gimpel 2016]
– ReLUप๚ञ஄टऐन଑ঢ়ਯऋෟैऊ‫؛‬
ReLUप๚ञ஄टऐन଑ঢ়ਯऋ
ऋෟ
ෟैऊ‫؛‬
BERTपुGPT-3पुઞॎोथ
थःॊ
ॊऐनICLLRऊैमজ४
४ख़ॡॺऔोथःॊ‫୑ཱུ؛‬ो‫؛‬
– BERTपुGPT-3पुઞॎोथःॊऐनICLRऊैमজ४ख़ॡॺऔोथःॊ‫୑ཱུ؛‬ो‫؛‬
• Swish [Ramachandran+, IC
ICLR WS 2018]
– ReLUप๚ञ஄टऐन଑ঢ়ਯऋෟैऊ‫ँ؛‬ो੉ढथःॊऒधऋGeLUध૗ॎैऩः‫؛‬
ऋෟै
ैऊ‫ँ؛‬ो੉ढथः
• SELU [Klambauer+, NIPS 2
2017]
– Self-Normalizing
g Neural Networks॑઀੧घॊ૛ધभ঳৖‫؛‬ELU॑ਫभ৖ীदु଀भ৖
Nettworkss॑઀੧घॊ૛ધभ
भ঳৖‫؛‬ELU॑ਫभ৖ীदु଀भ৖
ীदु৒ਯ೅खञुभ‫؛‬
– SELUঽ৬भ௬੼मষॎोथःऩः‫؛‬ਪഭ঻ुॶॵ॥॒दःॊभप୦௛ৢढञ‫؛‬
ःऩः
ः‫؛‬ਪഭ঻ुॶॵ॥
• GLU [Dauphin+, ICML 2017]
– LSTMभGate৖ী॑੅ढथऌञણਙ৲ঢ়ਯ‫؛‬
– ଍஄૗ఌٔણਙ৲ঢ়ਯ‫ق‬३ॢঔॖॻ‫ৢ॑ك‬खञ0~1भக॑ुणGateध‫ؚ‬શಥੑ઴खञ
଍஄૗ఌभਏಞ஋‫؛‬
ಖ২পखथ঱ऋॉऽच॒
ऒऒद੔प‫؟‬ৰୡ໪ਏ
• ૡ୎৾ಆୖ਻‫؟‬Text-to-Text Transfer Transformer (T5)
– ॸय़५ॺ॑োोथॸय़५ॺ॑লघളਯभୖ਻भञीभহ৐৾ಆ
– 6.1TBभॸय़५ॺ॑ইॕঝॱজথॢखथ৺750GBपखञColossal
Clean Crawled Corpus (C4)॑ਹ৷
– ম૛ધमઃभ3णभୖ਻఼॑৷
• QAृ௓૛ऩनभള়ॱ५ॡ
SuperGLUE [Wang+, NeurIPS 2019]
• ધभਏ৺
XSum [Narayan+, EMNLP 2018]
• ସਖૢ௦
WebQuestions [Berant+, EMNLP 2013]
• ਃ༊ᄽ๨ୖ਻‫؟‬WMT’14भஶஆᄽ๨ॱ५ॡ
ऒऒद੔प‫؟‬ৰୡ໪ਏ
उऒधॎॉ
ম૛ધभৰୡमNLPୖ਻दघ
ણਙ৲ঢ়ਯपऽणॎॊৰୡ੥ટ
• ঃছও‫ॱش‬ਯधੑ઴୤ऋ⑫अ஘प৹ତ
– Final loss = ધ১ঔॹঝਙચ‫ؚ‬ऒऒटऐ lower is better
– SGLUE = ള়ୖ਻
– XSum = ਏ৺ୖ਻
– WebQ = ସਖૢ௦
– WMT EnDe = ਃ༊ᄽ๨
• ੥ટ‫ৢુ؟‬खथਙચऋ਱঱खञু১ऋ૮ः
– ਙચ਱঱ٙ୬ஊध੉ढथःॊऐनखयखय৑ୀढथःॊभदିਔ
ಖ২ऋଐऎऩढञણਙ৲ঢ়ਯुँॊ
• GLUभ৅ன஄ [Shazeer, 2020]
– GLUम଍஄૗ఌٔણਙ৲ঢ়ਯपेॊ।‫ॺش‬ध଍஄૗ఌभਏಞ஋
– GLUঽ৬मણਙ৲ঢ়ਯ॑३ॢঔॖॻधखथःञ
• ਰৣभংজग़‫ش‬३ঙথ॑૥खथाञ
– GeGLU‫؟‬ણਙ৲ঢ়ਯऋGeLU‫ق‬ReLUाञःऩ஄भෟैऊऩृण‫ك‬
– ReGLU‫؟‬ણਙ৲ঢ়ਯऋReLU
– SwiGLU‫؟‬ણਙ৲ঢ়ਯऋSwish‫ق‬ReLUाञःऩ஄भෟैऊऩᄲ‫ك‬
• ঃছও‫ॱش‬ਯ5400੏भதೋপ੉ୁঔॹঝPaLMदुਹ৷ [Chowdhery+, 2022]
– LiGLU‫؟‬ણਙ৲ঢ়ਯऩख‫ق‬൨଍஄஄ૄ‫ك‬
– ँो‫ؚ‬ELUृSELUधुੌ়चथाऩःभ…‫ء‬
GLUभংজग़‫ش‬३ঙথभ௬੼
• ௬੼্১म৐஽ৢॉ
– Final loss = ધ১ঔॹঝਙચ‫ؚ‬ऒऒटऐ lower is better
– SGLUE = ള়ୖ਻
– XSum = ਏ৺ୖ਻
– WebQ = ସਖૢ௦
– WMT EnDe = ਃ༊ᄽ๨
• GLUधLiGLUम૘खৣऋॊ੥ટुँॊऋ…
౎भংজग़‫ش‬३ঙথम঳ฮखथ஍ટ॔জ
2. ਫૠ৲भ੝ఒभഄఴ
• Vanilla Transformer‫؟‬LayerNorm [NIPS DLS 2016]
– ঋॡॺঝभਏಞओधभ਴಑धীങदਫૠ৲‫؛‬
• RMS [Zhang+Sennrich, NeurIPS 2019]
– LayerNormऋ೚ःभद਴಑௷ऌदਫૠ৲‫؛‬
• ReZero [Bachlechner+, 2020]
– ॱॖॺঝभলटखऋReZero is All You Need…
লञे‫ ٳٳ‬is All You Need௺૛ધ‫؛‬
– ੪भTransformer‫ق‬ంડ‫ك‬भ૗ఌ৖ীप৾ಆ૭ચ
ঃছও‫ॱش‬į‫ੂق‬਋க८ট‫॑ك‬ෳऐॊ‫ق‬కડ‫؛ك‬
• Fixup [Zhang+, ICLR 2019]
– ਫૠ৲॑঳જचङप‫ؚ‬Residualঈটॵॡभੂ਋க॑0धऊ1पघॊट
ऐदुBatchNormृLayerNormध੺ःಖ২॑୸ਛ‫؛‬
ਫૠ৲पऽणॎॊৰୡ੥ટ
• লৃ৭ু
– RMS NormَLayerNormभੑ઴ଫऎघॊछُ
– ReZeroَResidualঈটॵॡभ૗ఌ৖ী॑į‫ੂق‬਋க0‫ك‬೅घॊछُ
– Fixupَੂ਋கठू॒धઅइॊधਫૠ৲ਂਏटछُ
஍ટभँढञু১मनोदखॆअऊ‫ء‬
ਫૠ৲पऽणॎॊৰୡ੥ટ
• লৃ৭ু
– RMS NormَLayerNormभੑ઴ଫऎघॊछُ
– ReZeroَResidualঈটॵॡभ૗ఌ৖ী॑į‫ੂق‬਋க0‫ك‬೅घॊछُ
– Fixupَੂ਋கठू॒धઅइॊधਫૠ৲ਂਏटछُ
੥ટ৅਀
• RMS Normभा஍ટ॔জ
– धःअऊReZeroऋপ૗ऩ૾ய
3. ஥औपणःथभਫ਼઒
• క௕भFeed Forward৖ী
– ଍஄૗ఌजभ‫ڭ‬ମReLUମ଍஄૗ఌजभ‫ڮ‬
– ঃছও‫ॱش‬ਯभॺঞ‫ॻش‬ड़ই॑৹सञः
• ৸৬भಽਯ‫ق‬క௕पउऐॊN‫ك‬
• ઌ॒রभ৖ীभઃ੪ਯ݀୤୤
• Multi-Head Attentionभঊॵॻਯ‫ܪ‬
• ৹सञ੥ટ
– Vanillaम 12 layers, ݀୤୤ = 3072, ‫ = ܪ‬12
– ಽਯऋ஥ः্ऋಖ২ଐऔऑटऋ‫ؚ‬1ଧਊञॉभ५ॸॵউੑ઴ऋ೚ः‫؛‬
4. ඇी੢ा্১पणःथभਫ਼઒
• InputधOutputमୁ៲¼ඇी੢ाઃ੪भঃছও‫ॱش‬
– NLPटधঃছও‫ॱش‬ਯप୶஭ऋপऌः
• ষഔীੰ‫ق‬ALBERT [Lan+, ICLR 2020] ेॉ‫ك‬
– ୁ៲¼ඇी੢ाઃ੪ ମୁ៲¼৔৖ઃ੪ध৔৖ઃ੪¼ඇी੢ाઃ੪
• ग़থ॥‫ॲش‬भোলৡदभඇी੢ा [Chung+, ICLR 2021]
– ુથ‫ق‬Tied‫ك‬ऊశુથ‫ق‬Untied‫ك‬ऊ
• ᄄ২पेॊඇी੢ाઃ੪भ৹ତ [Baevski+Auli, ICLR 2019]
– ଩ᄄ২ऩ౐ୁम଩ઃ੪भঋॡॺঝपඇी੢ि
• ৰୡ੥ટ‫ॹ؟‬॥‫ॲش‬भোলৡ॑ુથ‫ق‬ग़থ॥‫ॲش‬धमశુથ‫ك‬घॊध‫س‬
5. ঃছও‫ુॱش‬થ্১पणःथभਫ਼઒
• ALBERT [Lan+, ICLR 2020] ेॉ
– ૚ಽभঃছও‫॑ॱش‬৸৖ુથघॊ
– ੔ऺनभඇी੢ाभীੰधુથु૥घ
– ग़থ॥‫ॲش‬टऐ/ॹ॥‫ॲش‬टऐदુથ
• ৰୡ੥ટ‫؟‬প৬ॲও
– ञटखALBERTदमધभದ୞॑ਊथॊ௤ଷुোोथःॊऋ‫ؚ‬ऒभ૛
ધदमৌ଴धखथःऩःभदALBERTঽ৬धभૻຎदमऩः
6. Softmax
• Adaptive Softmax [Joulin+, ICML 2017]
– ౐ୁभᄄ২पૢगथୁ៲॑ॡছ५ॱজথॢମమಽ৓௙શदৈச৲
– ଩ᄄ২ୁमऔैपೝ୶खथೄ୤৲ٔৈச৲
• Mixture of Softmaxes [Yang+, ICLR 2018]
– Softmax॑‫ৢܭ‬ॉੑ઴खथ੎ातऐਮपेॊহ৏ન૨ੑ઴
• ৰୡ੥ટ‫؟‬
– Mixture of Softmaxesम‫ؚ‬ਙચऋଐऎऩढञॱ५ॡुँॊऐन
ੑ઴ச২ऋ40%଩ৣ
7. ৸৬भ॔‫ش‬य़ॸॡॳকभ੝ఒभഄఴ
Transparent Attention [Bapna+, EMNLP 2018]
Evolved Transformer [So+, ICML 2019]
Synthesizer variants [Tay+, 2020]
Funnel Transformer [Dai+, NeurIPS 2020]
Lightweight and Dynamic convolution [Wu+, ICLR 2019]
Mixture of Experts Transformer [Shazeer+, NeurIPS
2018][Lepikhin+, ICLR 2021]
• Switch Transformer [Fedus+, 2021]
• Product Key Memory [Lample+, NeurIPS 2019]
• Universal Transformer [Dehghani+, ICLR 2019]
•
•
•
•
•
•
ৰୡ੥ટ
ःेःे੗ऎथ๨ऋীऊै॒
7. ৸৬भ॔‫ش‬य़ॸॡॳকभ੝ఒभഄఴ
Transparent Attention [Bapna+, EMNLP 2018]
Evolved Transformer [So+, ICML 2019]
Synthesizer variants [Tay+, 2020]
Funnel Transformer [Dai+, NeurIPS 2020]
Lightweight and Dynamic convolution [Wu+, ICLR 2019]
Mixture of Experts Transformer [Shazeer+, NeurIPS
2018][Lepikhin+, ICLR 2021]
• Switch Transformer [Fedus+, 2021]
• Product Key Memory [Lample+, NeurIPS 2019]
• Universal Transformer [Dehghani+, ICLR 2019]
•
•
•
•
•
•
ಖ২ऋଐऎऩढञुभ/பऎऩढञुभ
Transparent Attention [Bapna+, EMNLP 2018]
Evolved Transformer [So+, ICML 2019]
Synthesizer variants [Tay+, 2020]
Funnel Transformer [Dai+, NeurIPS 2020]
Lightweight and Dynamic convolution [Wu+, ICLR 2019]
Mixture of Experts Transformer [Shazeer+, NeurIPS
2018][Lepikhin+, ICLR 2021]
• Switch Transformer [Fedus+, 2021]
• Product Key Memory [Lample+, NeurIPS 2019]
• Universal Transformer [Dehghani+, ICLR 2019]
•
•
•
•
•
•
Product Key MemoryधSynthesizer variants
• Product Key Memory [Dehghani+, • Synthesizer [Tay+, 2020]
ICLR 2019]
– প୤भkey॑શಥ৾ಆखऩऋै
multi-head attention़ःऒध॑
ृॊPKMभ઀੧
– ঳৖भಽभFFN॑PKMप૗इॊ
ध‫ेؚ‬ॉ৵ૠெऩॿॵॺড‫ش‬ॡ
दಖ২‫؞‬ச২up!
– ৐ᅚदमম૛ધध[Wu+, ICLR
2019]टऐऋFacebookभ૛ધ
‫౎ق‬म৸थGoogle௺‫ك‬
– ॔ॸথ३ঙথষഔ॑ܳ‫ ୃ ܭ‬ऊै
• োৡXभ଍஄૗ఌ+ReLU+଍஄૗ఌ
प‫ق‬Dense‫ك‬
• ुअೊਯदःःृ‫ق‬Random‫ك‬
– Performer [Choromanski+, ICLR’21] म
ࢗध࢑भढ़‫ॿش‬ঝद॔ॸথ३ঙথ
॑੺๚
Switch TransformerधMixture of Experts Transformer
[Shazeer+, NeurIPS 2018][Lepikhin+, ICLR 2021][Fedus+, 2021]
• Switch Transformer‫؟‬1.6ூ଻भঃছও‫ॱش‬
– धୂऎधপ૗जअटऋ‫ؚ‬FFNऋളਯँॊ‫ق‬Mixture of Experts‫ك‬
– Switch Transformerदम৭උ৓पऒभFFNभअठ঳ण॑৭व
– भद‫ؚ‬৸ঃছও‫॑ॱش‬൐৚भ৾ಆृ௓૛पઞअॎऐदमऩः
• ౟୥
– ऒोैभ঳৴भ૛ધम
Googleपेॊुभ
– ્पNorm Shazeerम
• Transformerਉ෩भਸ਼2෩঻
• ऒोैभ૛ધपु෩঻ध
खथোढथःॊ
ऽधी
• ੺ফभTransformerभ੝ఒু১भপ઄ऋ੪भTransformerध
ૻसथপ୷ऩः
– ളਯୖ਻दभሑ৷ਙऋ૮ः‫ؚ‬९‫ش‬५॥‫ऺुॻش‬ध॒न૗ॎैऩः
One possible explanation for this is that the originallyproposed Transformer architecture was near-perfect, and
there wasn't much that could be done to improve it.
‫ق‬ऒोम‫ؚ‬ਊੂ઀੧औोञTransformerभ॔‫ش‬य़ॸॡ
ॳকऋ౥᭷प੺ऎ‫ؚ‬੝ଐभ౟৉ऋँऽॉऩऊढञऒध
ऋ৶૓धखथઅइैोऽघ‫ك؛‬
෩঻ै
– ँॉऋञःત੉‫؟‬ৗञऩ੝ఒু১॑અइञৎम
َളਯৰಎ॑ঋ‫ش‬५पઞइَُCVुஅिളਯभୖ਻द௬੼चेُ
َঁॖঃ‫ش‬ঃছও‫⑫॑ॱش‬इेَُਈଐகगूऩऎ਴಑+ীങُ
जोदु‫؟‬஍ટऋનੳऔोञু১
• ৰम…Vanilla Transformerदुোढथःॊੵ୏ऋँॊ
– LayerNormऋ৏ ମLayerNormऋ੔ [Baevski+Auli 2019][Xiong+, 2020]
– ബৌகपेॊਜ਼઼ඇी੢ा ମৼৌ৓ऩਜ਼઼ඇी੢ा [Raffel+, 2019]
•
•
•
•
ણਙ৲ঢ়ਯ‫؟‬GLUधGeLU/Swishभੌ়च
ਫૠ৲‫؟‬RMS Norm
ॹ॥‫ॲش‬पउऐॊোলৡभীങ਀ਠभુથ
॔‫ش‬य़ॸॡॳকभੵ୏
–
–
–
–
Mixture of Experts Transformer
Switch Transformer
Product key memory
Synthesizer variants
その他の
Transformerのノウハウ
Layer Normalization ऋ৏ऊ੔ऊਖ਻
• Post-LN: ਙચऋৈःऋ‫ؚ‬ธಫऋਂ਍৒ ମෙ୷ऋಥরद଎ଷखथखऽअभऋਖ਻
• Pre-LN: ธಫऋ਍৒घॊऋ‫ؚ‬ਙચऋ଩ः ମোৡधলৡऋ౟ॉ૗ॎैऩःभऋਖ਻
• Bottom-to-Top (B2T) connection [Takase+, 2022]
– Post-LNभ૗ఌચৡधPre-LNेॉરोञธಫ਍৒ਙ॑૏য়
• DeepNorm [Wang+, 2022]
– Post-LNঋ‫ش‬५द‫ؚ‬Residualமਢऋजभऽऽ‫ق‬1೅‫ك‬ਸ઴औोॊभ॑৒ਯ೅পऌऎघॊ
– ঃছও‫ॱش‬भੂ਋க‫ق‬भ঳৖॑‫ك‬৒ਯदસढथ৵औऎघॊ
– 1000ಽभTransformerदुธಫदऌॊेअपऩढञ
Positional Encoderभ੝ଐधWarmup
Positional Encoder
• ੪‫ر‬मബৌਜ਼઼भඇी੢ा
• ৼৌਜ਼઼पेॊਜ਼઼ඇी੢ा [Shaw+, NAACL’18] [Raffel+, 2019]
• ബৌਜ਼઼भਜ਼઼ඇी੢ाप३ইॺਂ૗ਙ॑଑ো [Kiyono+, EMNLP’21]
• RoPE (Rotary Position Embedding) [Su+, 2021]
৾ಆ૨भWarmup
• ਈੂम‫ ݉ݑ݊_݌݁ݐݏ‬पৌखथ଍஄पੜਸ
• ‫݉ݑ݊_݌݁ݐݏ‬ऋ‫݌݁ݐݏ_݌ݑ݉ݎܽݓ‬धಉखऎऩॊध
‫݉ݑ݊_݌݁ݐݏ‬भ਴্உपಗૻ୻खथ0पሰ੺
सऌଭಋभઓःল
• OpenAIपेॊGPT-3द௴೾औोञ১ಋ [Henighan+, 2020]
• ੑ઴੒઴‫ॱشॹؚ‬७ॵॺ१ॖ६‫ؚ‬ঃছও‫ॱش‬ऋ‫ݐ‬೅पऩॊध௤ଷऋ
ܽ‫ ݐ‬௕ ೅৵औऎऩॊ‫ܽق‬, ܾम৒ਯ‫ك‬
णऽॉ…
জ९‫ش‬५भ੗ःᄲऋ৤ण
सऌଭಋभઓःল
• GoogleपेॊPaLMदम…
• ঃছও‫ॱش‬ऋ8Bମ62Bମ540Bधੜइञৎपశ৴ਢ૗৲
णऽॉ…
ಌप
জ९‫ش‬५भ੗ःᄲऋ৤ण
লथऎॊऊुखोऩः጑௴
ೋপऩ੦ೕঔॹঝ॑পૠெॹ‫ॱش‬७ॵॺदธಫखथ
৏ਢॱ५ॡ৾ಆदऌॊधऒौऋਘः॒ट…
• Exploring the Limits of Large Scale Pre-training [Abnar+, ICLR’22]
– JFT-300M॑঱૴ॱ५ॡधखथ৾ಆखञৎभಖ২‫ق‬૯ກ‫ك‬
– vs. ਢःथৣ૴द૚ॱ५ॡ॑৾ಆखञधऌभಖ২‫ق‬ອກ‫ك‬
঱૴ॱ५ॡभಖ২ऋ
঱ऋढथु‫ؚ‬
ಥরऊैৣ૴ॱ५ॡ
भಖ২म୍ઊठ
औःओप
Transformerभ੦ম৓ऩ৿੿ऊैૢ৷෇೧‫ؚ‬
ਈ੺भMLP௺ॿॵॺড‫ش‬ॡऽद॑ᖊᰐखञ
ञ
• जुजुTransformerढथ‫ء‬
࢞
• Transformerᄑ௯
• Transformerमड़ড॥থ‫آء‬
• Transformerभঀक़ঁक़
షವংॖ॔५ध૑ਏॹ‫ॱش‬୤‫؞‬ੑ઴ਃभ
ॺঞ‫ॻش‬ड़ই
• ঋॡॺঝ॑଻શप૗ఌ
• ঋॡॺঝ॑ऽछथ૗ఌ
ମ ॔ॸথ३ঙথमથ஍टऐन
৸थगूऩः
࢞
CNN
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
RNN
࢞
࢞
࢞
࢞
Transformer/MLP௺
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
౦‫ୂر‬ऌघऍथଐऎীऊैऩः‫آ‬धःअযभञीप
• Transformerमॺ‫ش‬ॡথ‫ृୁ౐ق‬ঃॵॳऩन‫॑ك‬Residualமਢखऩऋ
ै2णभঔ४গ‫ش‬ঝ॑೷ॉନखथःॊटऐ‫آ‬
– ॺ‫ش‬ॡথ॑೴छॊToken Mixer
– ॺ‫ش‬ॡথ॑૗ఌघॊMLP
• ਵ਻पऩढञMLP௺ुৰम੦ম৓प৊गଡୗ‫آ‬
ਡ଍৖॑ളਯ৚೷ॉନघ
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
࢞
ঋॡॺঝ॑
ऽछथ૗ఌ
ਫૠ৲
࢞
ਫૠ৲
࢞
ঋॡॺঝ॑
଻શप૗ఌ
࢞
• …ँधम౦‫ر‬धঀक़ঁक़ऋँॊभदਞ॑હऐऽखॆअ
Download