Transformerの最前線 〜 畳込みニューラルネットワークの先へ 〜 オムロン サイニックエックス 牛久 祥孝 losnuevetoros SSII 2022ࡏනࢶ Transformerभਈ ع႙ाॽগشছঝॿॵॺডشॡभष ع ฟ୲ ໋ේقड़টথ१ॖॽॵॡग़ॵॡ५ك ঽഞງஂ 2013.6ع2013.8 2014.3 2014.4ع2016.3 2016.4ع2018.9 2016.9ع 2016.12ع2018.9 2018.10ع 2019.1ع 2020.4ع 2021.7ع Microsoft Research Intern ൸य़কউ३ঙথেਛ भ્યध [Ushiku+, ACMMM 2012] य़কউ३ঙথभৼ൩ਫ਼ด [Ushiku+, ICCV 2015] [Yamaguchi+, ICCV 2017] (ੲਾ৶ੵ৾)ূؚপ৾ NTT CSଢ଼ ଢ଼৩ ূপ৾ ప (ਉিฟ୲ଢ଼) ਓૼ়ଢ଼ਚ ੈৡଢ଼৩ A yellow train on the tracks A guy is skiing with no shirt on near a train station. and yellow snow pants. বয়বୁଢ଼ਚ ુଢ଼৩ ड़টথ१ॖॽॵॡग़ॵॡ५ઙૄভ Principal Investigator ઙૄভ Ridge-i ഁ૽ Chief Research Officer ஸি྆প৾ శଞඐప ূਨপ৾ శଞඐప ਚര੮৬ ACMؚIEEEؚਗ਼ੲਾৢਦ৾ভؚੲਾ૪৶৾ভؚমটॵॺ৾ভؚযੵੴચ৾ভؚ ૢ৷৶৾ভؚ૦ണੲਾ৾ভؚমॹॕشউছॽشথॢੈভ মपणःथ Transformerभ੦মऩऊैૢ৷೧ؚਈभMLP௺ॿॵॺডشॡ ऽदभᖊᰐ • • • • जुजुTransformerढथء Transformerᄑ௯ Transformerमड़ড॥থآء Transformerभঀक़ঁक़ ବমभৱમभਈৗংش४ঙথ ق१ঈॱॖॺঝपुൕൗखथउॉऽघك ౦ୂرऌघऍथଐऎীऊैऩःآधःअযभञीप • Transformerमॺشॡথृୁقঃॵॳऩन॑كResidualமਢखऩऋ ै2णभঔ४গشঝ॑ॉନखथःॊटऐآ – ॺشॡথ॑छॊToken Mixer – ॺشॡথ॑ఌघॊMLP • ਵपऩढञMLP௺ुৰम੦মपगଡୗآ ਡ॑ളਯॉନघ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ঋॡॺঝ॑ ऽछथఌ ਫૠ৲ ࢞ ਫૠ৲ ࢞ ঋॡॺঝ॑ શपఌ ࢞ • …ँधम౦رधঀक़ঁक़ऋँॊभदਞ॑હऐऽखॆअ そもそもTransformerって? Transformer௺ધदेऎৄॊ Transformer௺ધदेऎৄॊ ᴍჺऩटख ী ীऊॉृघऎৄइॊ ऐनীऊॉपऎः ͤযभ୳दघ ͤ Transformerधम ભਔभਯभঋॡॺঝ॑ભਔभਯभঋॡॺঝपఌघॊ ॿॵॺডشॡ ᅖ૮ଳ Transformerधम ભਔभਯभঋॡॺঝ॑ભਔभਯभঋॡॺঝपఌघॊ ॿॵॺডشॡ লৡ؟ ભਔभਯभঋॡॺঝ ᅖ૮ଳ োৡ؟ ભਔभਯभঋॡॺঝ Positional Encoding?? Positional Encoder োৡঋॡॺঝभਜ਼઼धলৡঋॡॺঝभਜ਼઼॑घঋॡॺঝ • ୁभীങਠऩैَધরभ୦ୁ৯ُء • ൸भ્ඉऩैَ൸भनऒभౠఏُء • ৎ௺ഔभ્ඉऩैَ୦ଧৎਡभੲਾُء [Vaswani+, NIPS’17]दमୁभਜ਼઼॑ १ॖথ؞॥१ॖথदঋॡॺঝ৲ ମణभঐشॡम१ॖথ؞॥१ॖথभਔ RNNृCNNम્ඉभਜ਼઼ੲਾ॑ੴढथःॊ • َRNNुCNNुਏಞभৼৌਜ਼઼॑ੴढथःॊُ [Shaw+, NAACL-HLT’18][Dai+, ACL’19] • َCNNमਏಞभബৌਜ਼઼॑৾ಆखथःॊُ[Islam+, ICLR’20] ൸भႀ෩ਙदؚ൸৸৬॑োৡ vs. भ൸॑োৡମजोझो”൸भরఙ”मႀ෩ਙऋপऌः Autoregressive vs. Non-autoregressive লৡभঋॡॺঝ॑… • Autoregressive= 1ङणघॊ – োৡभঋॡॺঝණधলৡੋाभঋॡॺঝණ॑৷ःथ ৗञऩঋॡॺঝ॑লৡ – َીॎॉُ॑ਔघॊୁऋলॊऽदॉନघ – ੪भTransformer [Vaswani+, NIPS’17] मAutoregressive – ಖ২मଐःऋः + ৯भ লৡঋॡॺঝ ᅖ૮ଳ भলৡੋा ঋॡॺঝ ࡷभ লৡঋॡॺঝ • Non-autoregressive=৸ਞपघॊ – ୦ैऊभ্১दളਯभলৡঋॡॺঝ৷भ َॱॿُधऩॊঋॡॺঝ॑োৡ[Gu+, ICLR’18][Guo+, AAAI’19] • ధഔद1दੑदऌॊभदசःऋؚಖ২ऋৣ – ਞपघॊभ॑ॉନखथୁਯ॑৹ତ [Ghazvininejad+, EMNLP’19][Gu+, NeurIPS’19] ᅖ૮ଳ ࡷभَॱॿُ ঋॡॺঝ ऌढधऒऒऽदमok ᅖ૮ଳ ઃम૮ଳखथःञীभরମ ऒऒम • Residualமਢ • ঋॡॺঝपMLP॑ి৷ ٙ1x1 conv धಉ धःअमऒऒभ Multi-Head Attention॑ହ घॊञीभ ઃम૮ଳखथःञীभরମ ऒऒम • Residualமਢ • ঋॡॺঝपMLP॑ి৷ ٙ1x1 conv धಉ धःअमऒऒभ Multi-Head Attention॑ହ घॊञीभ धःअमऒऒभ Scaled Dot-Product Attention ॑ହघॊञीभ Multi-head Attention उऔैः ؟ඞअभमભਔभਯभঋॡॺঝ • ঽேୁ૪৶दँोयୁभীങਠभ௺ഔٔਜ਼઼ੲਾ ࢞ Ougai ࢞ , ࢞ you ࢞ look ࢞ like ࢞ the ࢞ cat ࢞ that ࢞ ate ࢞ the ࢞ canary ࢞ . • ൸ੳदँोयॱॸ¼চ॥पॳকॿঝਯীभ ઃ੪भঋॡॺঝقଂਚ્ඉكऋ ధ॒टਯഔٔॱॸ؞চ॥भਜ਼઼ੲਾ ⌢ਗऔ॒١ Multi-head Attention उऔैः ؟ඞअभमભਔभਯभঋॡॺঝ • ঽேୁ૪৶दँोयୁभীങਠभ௺ഔٔਜ਼઼ੲਾ ࢞ Ougai ࢞ , ࢞ you ࢞ look ࢞ like ࢞ the ࢞ cat ࢞ ࢞ that ࢞ ate the • ൸ੳदँोयॱॸ¼চ॥पॳকॿঝਯীभ ࢞ ઃ੪भঋॡॺঝقଂਚ્ඉكऋ ࢞ ధ॒टਯഔٔॱॸ؞চ॥भਜ਼઼ੲਾ ࢞ ࢞ ࢞ canary ࢞ ࢞ . ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ⌢ਗऔ॒١ ऒभঋॡॺঝ ࢞ पି৯खथઅइॊ 1. भঋॡॺঝَ॑ਫ਼ดُघॊञीभॡग़জ = ܹ ொ ࢞॑ੑ ொ = ܹ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ऒभঋॡॺঝ ࢞ पି৯खथઅइॊ 2. َథ๚ਫ਼ดُৌधखथभঋॡॺঝभय़ ܹ = ش ࢞॑ੑ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ऒभঋॡॺঝ ࢞ पି৯खथઅइॊ 3. ॡग़জधय़شभ॑ઃ੪ਯ݀धsoftmaxदਫૠ৲खञَథ๚২ُ ܽ = softmax ୃ / ݀ ॑ੑ = ܽ = ܽ = ܽ = ܽ = ܽ = ܽ = ܽ = ܽ = ܽ = ܽ = ܽ = ܽ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ऒभঋॡॺঝ ࢞ पି৯खथઅइॊ 4. ঋॡॺঝभ৻க࢜ = ܹ ࢞॑ੑ ࢜ ܽ ࢜ ܽ ࢜ ܽ ࢜ ܽ ࢜ ܽ ࢜ ܽ ࢜ ܽ ࢜ ܽ ࢜ ܽ ࢜ ܽ ࢜ ܽ ࢜ ܽ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ऒभঋॡॺঝ ࢞ पି৯खथઅइॊ 5. థ๚২ܽदातऐखञਮσܽ࢜॑ੑମঋॡॺঝ ಌৗभঋॡॺঝ ࢞ पਸقresidualமਢથك ࢞ ࢜ ڄ ܽ ࢜ ڄ ܽ ࢜ ڄ ܽ ࢜ ڄ ܽ ࢜ ڄ ܽ ࢜ ڄ ܽ ࢜ ڄ ܽ ࢜ ڄ ܽ ࢜ ڄ ܽ ࢜ ڄ ܽ ࢜ ڄ ܽ ࢜ ڄ ܽ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ Scaled Dot-Product Attention धःअपჾਊ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ুದ1~5॑घसथभঋॡॺঝपৌखथৰষघॊ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ Multi-head Attentionदम धःअपჾਊ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ষഔܹ ொ , ܹ , ܹ ݄॑৷ਔखؚजोझो॑৷ःऩऋै ুದ1~5॑घसथभঋॡॺঝपৌखथৰষघॊ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ Transformer ऽधी ભਔभਯभঋॡॺঝ॑ Encoder-Decoder ૄदघॊૼ Nॉନघ इयGPT-3दम • 96ॉନघ • 96-headsभ Attentionق128ઃ ੪ك ମঃছওॱشਯ 175B ͤ2022ফ4াपGoogle ऋਁ৫खञPaLMम540B [Chowdhery+, 2022] قAttentionभ૮ःكCNNृRNNधभୀः Transformerम… ৸भঋॡॺঝऊै৸भঋॡॺঝषभ॔ॸথ३ঙথ॑ੑ ܱ(݊ଶ )टऋઁୠपੲਾऋট५૮ऎ૭ચ CNNम… RNNम… 3णऩनؚ৸৬ऊैघोयਯभ ঋॡॺঝ॑णङणਪखऩऋै ঋॡॺঝटऐभ႙ाੑ॑ਪ भ७ঝपਯ੶ق༨॑ك৳ோ ܱ ݊ टऋੲਾऋॎॊभमๆटऐ ܱ ݊ टऋশः௺ഔमਂ੭ু CNN ࢞ ࢞ ࢞ ࢞ ࢞ RNN ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ Transformer旋風 Computer Visionਰਗ • ঽேୁ૪৶ – ᄽऊणਉ෩ [Vaswani+, NIPS’17] जभਯآآآ – ୁঔॹঝ GoogleभBERT [Devlin+, NAACL-HLT’19], PaLM [Chowdhery+, 2022], DeepMindभGopher [Rae+, 2022], OpenAIभGPT-2/3 [Radford+, 2019][Brown+, NeurIPS’20] ऩनਯ – 2ூॺشॡথऊैऩॊॹॱشঋش५भਫ਼ด [Borgeaud+, 2021] • ఠଢ૪৶؞ਦಀ૪৶ – – – – ਠ৾ಆ HuBERT [Hsu+, TASLP’21], SSAST [Gong+, AAAI’22] ఠଢੳ [Lüscher+, INTERSPEECH’19] ఠ௫েਛ [Huang+, ICLR’19][Choi+, ICML’20] ৎ௺ഔ [Li+, NeurIPS’19][Wu+, NeurIPS’20] • ॸشঈঝॹॱش – FT-Transformer [Gorishniy+, NeurIPS’22] ͤञटखॹॱشमൂேधखथGradient Boostingऋਘः • Bio/Chem-informatics – ীଡୗੰෲ [Fuchs+, NeurIPS’20][Rong+, NeurIPS’20] • ग़ش४ख़থॺ؞টॸॕॡ५ – ঐঝॳग़ش४ख़থॺৢਦ [Inala+, NeurIPS’20] – One Shotदெኼ৾ಆ [Dasari+Guputa, CoRL’20] – ॱ५ॡ௺ഔभਘ৲৾ಆ Scene Memory Transformer [Fang+, CVPR’19], Decision Transformer [Chen+, NeurIPS‘21], Trajectory Transformer [Janner+, NeurIPS‘21], Gato [Reed+, 2022] Vision & Language ਠ৾ಆମ੦ೕঔॹঝ VideoBERT [Sun+, ICCV’19] LXMERT [Tan+Bansal, EMNLP’19] ViLBERT [Lu+, NeurIPS’19] VL-BERT [Su+, ICLR’20] UNITER [Chen+, ECCV’20] OSCAR [Li+, ECCV’20] Voken [Tan+Bansal, EMNLP’20] COOT [Ging+, NeurIPS’20] Perceiver [Jaegle+, ICML’21] PolyViT [Likhosherstov+, 2021] Flamingo [Alayrac+, 2022] [Tan+Bansal, EMNLP’20] සਠ৶ੰ TextVQA [Kant+, ECCV’20] य़কউ३ঙথেਛ ਫ਼ด [Gabeur+, ECCV’20] MDETR [Kamath+, ICCV’21] [Zhou+, CVPR’18][Li+, ICCV’19][Cornia+, CVPR’20] [Cornia+, CVPR’20] He starts his motorbike Then walks away. િ௺ ুभિ [Huang+, ECCV’20] Proposed ুਵभੳ [Saunders+, ECCV’20] ୩ୠীસ௺ Panopitc Segmentation [Kirillov+, ECCV’20] Proposed Semantic Segmentation पु Instance Segmentation पु [Zhang+, ECCV’20] ି؟xxxx Segmentationऋञऎऔ॒ँॊءءء [Kirillov+, CVPR 2019] ൸৶ੰ ৽ଡ଼ [Yu+, ECCV’20] ੳ [Girdhar+, ECCV’20] जभभ ൸েਛঔॹঝ தੰ൸ٔଓ [Parmar+, ICML’18] தੰ൸ [Yang+, CVPR’20] ્৬ਫ਼ด Fine-grainedશ [Kim+, ECCV’20] ५ॣॵॳदਫ਼ด [Ribeiro+, CVPR’20] ৬ਫ਼লقDETRك ৬ਫ਼লقంكधPanoptic Segmentationقక[كCarion+, ECCV’20] ৬ਫ਼লम[Chi+, NeurIPS’20]ु੧ Vision Transformer [Dosovitskiy+, ICLR’21] • ൸ऊैTransformerभाद৾ಆघॊध – ResNet152ಽधऺऻಉभಖ২ऊण25%ங২भ৾ಆৎ – JFT-300Mधःअপૠெॹॱش७ॵॺऋਏ – ੴႪ॑ੌा়ॎचथؚImageNetभ1000ॡছ५൸ॹॱشभाभ৾ಆदु EfficientNet॑தइञDeiTुથ [Touvron, ICML’21] • ग़থ॥ॲشभाभTransformer – ৰमड़জ४ॼঝभTransformerेॉुଡୗऋෞद৶ੰुઍಔ MobileViT • রఙभঈটॵॡदम… [Mehta+Rastegari, ICLR’22] ݀ Unfold ݀ – ঃॵॳ॑ంక্पன৫ (Unfold) ݄ ݓ – ອप५ছॖ५खञঋॡॺঝૐ়पTransformer॑ి৷ ܲ = ݄ݓ • ःఌइॊधؚ݀ઃ੪ঋॡॺঝऋܰँॊૐ়दभTransformer॑Pి৷ – ुअ২ঃॵॳ॑ਫ্पರघ (Fold) MV2: MobileNet v2 block ଯ2: down sampling MobileViT • ImageNetभ1000ॡছ५൸ॹॱشटऐद৾ಆ भMobile௺ॿॵॺডشॡ ेॉৈಖ২ [Mehta+Rastegari, ICLR’22] भViT௺ॿॵॺডشॡ ेॉঃছও&ॱشৈಖ২ Vision Transformerद৬ਫ਼লृ७ॢওথॸش३ঙথ॑ষअपम • ViTभेअऩ႗ःঃॵॳटऐटध…ऊःংक़থ ॹॕথॢॵॡ५भਜ਼઼ৠीृ७ॢওথॸش ३ঙথभಖ২ऋठॊ [Dosovitskiy+, ICLR’21] • ্दऊःঃॵॳ॑ऎढथखऽअधؚ॔ ॸথ३ঙথभੑदৎऋऊऊॊقঃॵॳਯभ2 ଭड़كشॲش Swin Transformer • CNNदउऩगाभআছॵॻଡୗ॑ठि Best paper! [Liu+, ICCV’21] – ॔ॸথ३ঙথੑ॑টشढ़ঝक़ॕথॻक़पखथੑచ – ऒोैभঞॖখشधखયજॉ॑ङैखञঞॖখ॑شઐ൩पგि ମটشढ़ঝक़ॕথॻक़॑தइञਭઍ॑ਛ – ৬ਫ਼লद௬ق७ॢওথॸش३ঙথमSwin-Unet [Cao+, 2021]ك Vision Transformer ध CNN भଡୗૻຎ • ViTधCNNभਠଡୗ [Raghu+, NeurIPS’21] – ViTमेॉऩਠ॑ठৣؚਜ਼ಽधਜ਼ಽभథ๚ਙऋৈः – ঔॹঝपႀ෩ऩ୷౮ऋँॊ • ଂਚ؞পୠऩੲਾभਹ৷্১ – ViTमResNetेॉुপୠऩੲਾ॑ঞॖখشपॉ॒दःॊ – ञटखৣਜ਼ಽपଂਚऩੲਾ॑ॉिऒधमൂேधखथਏदؚ পૠெऩহ৾ಆॹॱشऋਏ • ViTभॿॵॺডشॡଡୗ – ViTभ५य़ॵউமਢमResNetsेॉुऔैप୶ৡऋਘऎؚਙચध ਠभథ๚ਙपਘः୶॑ଖइॊ • ViTभਜ਼઼ੲਾ – ViTभ্ऋ൸दभਜ਼઼ੲਾऋ৳औोथःॊ Transformerはオワコン?! MLP-Mixer [Tolstikhin+, NeurIPS’21] gMLP [Liu+, NeurIPS’21] gMLPपेॊ൸ীథಖ২ૻຎदम ऺध॒नघसथभ CNN൸ীథ ेॉु Transformerभ ൸ীథ ेॉु قभMLP௺ेॉुك gMLPभ্ऋ ಖ২ऋଐः/ ঃছওॱشऋऩः दলथऎॊ୳ऋ Transformerढथड़ড॥থदमءآ 結論から言うと TransformerとMLP-MixerとgMLPは 本質的にやっていることは余り違いません 単にAttention is All You Needではなかっただけです Transformer [Vaswani+, NIPS’17] Vision Transformer [Dosovitskiy+, ICLR’21] ਏघॊप ਡ॑ളਯॉନघ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ঋॡॺঝ॑ ऽछथఌ ਫૠ৲ ࢞ ࢞ ਫૠ৲ ࢞ ঋॡॺঝ॑ શपఌ MLP-Mixer [Tolstikhin+, NeurIPS’21] ਏघॊप ਡ॑ളਯॉନघ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ঋॡॺঝ॑ ऽछथఌ ਫૠ৲ ࢞ ࢞ ਫૠ৲ ࢞ ঋॡॺঝ॑ શपఌ gMLP [Liu+, NeurIPS’21] ਏघॊप ਡ॑ളਯॉନघ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ঋॡॺঝ॑ શपఌ ਫૠ৲ ࢞ ࢞ ਫૠ৲ ࢞ ঋॡॺঝ॑ ऽछथఌ ঋॡॺঝ॑ શपఌ गधऒौ vs. ॎढञधऒौ • ा॒ऩगधऒौ – ঋॡॺঝभૐ়॑ఌघॊঔ४গشঝ॑ॉନखి৷घॊ – ఌमঋॡॺঝ॑ऽछथఌघॊऊؚঋॡॺঝ॑શपఌघॊ ऊभ2ৢॉ – ঋॡॺঝ॑શपఌघॊ্১मMLP – ෙ୷ଷृೌ॑ଆएञीभਫૠ৲قLayer Normalizationك – ग৯दোऔोथःॊSkip Connection • Transformerऊैॎढञधऒौ – ঋॡॺঝ॑ऽछथఌघॊ্১ऋAttentionऊैষഔपऩढञ – ঋॡॺঝभਜ਼઼ੲਾ॑ঋॡॺঝঽମऋ৳قTransformerك ମঋॡॺঝभॖথॹॵॡ५पಡचथॿॵॺডشॡऋ৳قMLP௺ك गधऒौ vs. ॎढञधऒौ • ा॒ऩगधऒौ – ঋॡॺঝभૐ়॑ఌघॊঔ४গشঝ॑ॉନखి৷घॊ – ఌमঋॡॺঝ॑ऽछथఌघॊऊؚঋॡॺঝ॑શपఌघॊ gMLPभધभਾઔ ऊभ2ৢॉ َॿॵॺডشॡमऺध॒नMLPटऐनؚ – ঋॡॺঝ॑શपఌघॊ্১मMLP ठॆढधटऐAttentionোोञু১قaMLPكम Normalizationك – ෙ୷ଷृೌ॑ଆएञीभਫૠ৲قLayer औैपৈਙચُ – ग৯दোऔोथःॊSkip Connection • Transformerऊैॎढञधऒौ – ঋॡॺঝ॑ऽछथఌघॊ্১ऋAttentionऊैষഔपऩढञ – ঋॡॺঝभਜ਼઼ੲਾ॑ঋॡॺঝঽମऋ৳قTransformerك ମঋॡॺঝभॖথॹॵॡ५पಡचथॿॵॺডشॡऋ৳قMLP௺ك ऩ॒थऒध॑2021ফႃऊैढथःञै… MetaFormer [Yu+, CVPR’22] • MLP-like modelधTransformerभୀःम Token Mixerभীटऐ ମऽऔपَঋॡॺঝ॑ऽछथఌُ • छॊभPoolingदुଐऎऩःء • ImageNetभ1000ॡছ५ॹॱشभ৾ಆद ViT௺ृMLP௺भু১ेॉुৈಖ২टे ऩ॒थऒध॑2021ফႃऊैढथःञै… HyperMixer [Mai+, 2022] • MLP-MixerधTransformerभୀःम Token Mixingभীटऐ ମऽऔपَঋॡॺঝ॑ऽछथఌُ • MLP-Mixerधୀढथਜ਼઼ਂऩToken Mixing॑ষअHyperMixer॑ढञे • ঽேୁ૪৶भॱ५ॡदଐऩಖ২ ॑ৰਠखञे もう一度結論を言うと TransformerとMLP-MixerとgMLPは 本質的にやっていることは余り違いません 単にAttention is All You Needではなかっただけです Transformerのノウハウ Transformerभ੦মਙચ॑ੱऋऐञཾ • • • • • • • • • • ELU [Clevert+, ICLR 2016] GeLU [Hendrycks+Gimpel, 2016] Swish [Ramachandran+, ICLR WS 2018] SELU [Klambauer+, NIPS 2017] GLU [Dauphin+, ICML 2017] RMS [Zhang+Sennrich, NeurIPS 2019] ReZero [Bachlechner+, 2020] Fixup [Zhang+, ICLR 2019] Adaptive Softmax [Joulin+, ICML 2017] Mixture of Softmaxes [Yang+, ICLR 2018] …ऋ૮ྤपਢऐैोथःॊফآء Transparent Attention [Bapna+, EMNLP 2018] Evolved Transformer [So+, ICML 2019] Synthesizer variants [Tay+, 2020] Funnel Transformer [Dai+, NeurIPS 2020] Lightweight and Dynamic convolution [Wu+, ICLR 2019] Mixture of Experts Transformer [Shazeer+, NeurIPS 2018][Lepikhin+, ICLR 2021] • Switch Transformer [Fedus+, 2021] • Product Key Memory [Lample+, NeurIPS 2019] • Universal Transformer [Dehghani+, ICLR 2019] • • • • • • ॶॵ॥੪पणःथ • • • • Google Researchभরभয16যद 30॑தइॊTransformerभఒু১॑ 50॑தइॊংজग़ش३ঙথध 9भ௴ਡद௬खञટ পभু১ऋ੪भTransformerधপ୷ऩऊढञेधअਵ ఒ্১भীథ 1. 2. 3. 4. 5. 6. 7. ણਙ৲ঢ়ਯ ਫૠ৲ औ ඇीा ঃছওુॱشથ Softmaxभଐ ৸৬भ॔شय़ॸॡॳক ͤ ङखुTransformerभఒ॑ਔखथऔोञু১टऐ दम૮ःभदିਔ 1. ણਙ৲ঢ়ਯभఒभഄఴ • ELU [Clevert+, ICLR 2016] – भীटऐਯঢ়ਯद-1पሰखؚ੦মपෟैऊऩ॑धॊ؛ – ਬ৷ਯ3000தآ • GeLU [Hendrycks+Gimpel, 2016] – ReLUप๚ञटऐनঢ়ਯऋෟैऊ؛ – BERTपुGPT-3पुઞॎोथःॊऐनICLRऊैमজ४ख़ॡॺऔोथःॊཱུ؛ो؛ • Swish [Ramachandran+, ICLR WS 2018] – ReLUप๚ञटऐनঢ়ਯऋෟैऊँ؛ोढथःॊऒधऋGeLUधॎैऩः؛ • SELU [Klambauer+, NIPS 2017] – Self-Normalizing Neural Networks॑੧घॊધभ؛ELU॑ਫभীदुभ ীदुਯखञुभ؛ – SELUঽ৬भ௬मষॎोथःऩः؛ਪഭुॶॵ॥॒दःॊभप୦ৢढञ؛ • GLU [Dauphin+, ICML 2017] – LSTMभGateী॑ढथऌञણਙ৲ঢ়ਯ؛ – ఌٔણਙ৲ঢ়ਯق३ॢঔॖॻৢ॑كखञ0~1भக॑ुणGateधؚશಥੑखञ ఌभਏಞ؛ 1. ણਙ৲ঢ়ਯभఒभഄఴ ਏघॊप… • ELU [Clevert+, ICLR 2016 2016] ReLU – भীटऐਯঢ়ਯद-1पሰखؚ੦মपෟैऊऩ॑धॊ؛ भীटऐਯঢ়ਯद-1 1प पሰखؚ੦মपෟ ෟैऊ ऊऩ॑धॊ؛ – ਬ৷ਯ3000தآ • GeLU [Hendrycks+Gimpel, [Hendrycks+Gimpe 2016] – ReLUप๚ञटऐनঢ়ਯऋෟैऊ؛ ReLUप๚ञटऐनঢ়ਯ ਯऋෟैऊ؛ BERTपुGPT-3पुઞॎो ोथ थःॊऐनICLRऊैमজ४ ४ख़ॡॺऔोथःॊཱུ؛ो؛ – BERTपुGPT-3पुઞॎोथःॊऐनICLRऊैमজ४ख़ॡॺऔोथःॊཱུ؛ो؛ • Swish [Ramachandran+, ICLR WS 2018] – ReLUप๚ञटऐनঢ়ਯऋෟैऊँ؛ोढथःॊऒधऋGeLUधॎैऩः؛ • SELU [Klambauer+, NIPS 2017] 017] भীऋৣऋॊ প৬ReLUदෟैऊ – Self-Normalizing Sellf-N Normalizing g Neurall Networks॑੧घॊધभ؛ELU॑ਫभীदुभ Netw w works॑੧घॊધभ भ؛ELU॑ਫभ ীदु ुभ ীदुਯखञुभ؛ ীद दुਯखञुभ भ؛ – SELUঽ৬भ௬मষॎोथःऩः؛ਪഭुॶॵ॥॒दःॊभप୦ৢढञ؛ SELLU Uঽ৬भ௬मষ ষॎोथः ःऩः؛ਪഭुॶॵ॥॒दःॊभप୦ৢढञ؛ ः ॥॒दःॊभप୦ ୦ৢढञ ञ • GLU [Dauphin+, ICML 2017] – LSTMभGateী॑ढथऌञણਙ৲ঢ়ਯ؛ LST TM MभGatteী॑ढथऌञ ञણਙ৲ঢ়ਯ؛ ञ ఌٔણਙ৲ঢ়ਯق३ॢঔॖॻৢ॑كखञ0~ ఌٔણਙ৲ঢ়ਯق३ॢ 1भ भக॑ुणG Gate eध धؚશಥੑखञ ؚશ શಥੑ – ఌٔણਙ৲ঢ়ਯق३ॢঔॖॻৢ॑كखञ0~1भக॑ुणGateधؚશಥੑखञ ఌभਏಞ؛ ELU/SELU GeLU/Swish 1. ણਙ৲ঢ়ਯभఒभഄఴ जखथवढठूऐ… • ELU [Clevert+, ICLR 2016] – भীटऐਯঢ়ਯद-1पሰखؚ੦মपෟैऊऩ॑धॊ؛ पሰखؚ੦মपෟै – ਬ৷ਯ3000தآ • GeLU [Hendrycks+Gimpel, [Hendrycks+Gimpel 2016] – ReLUप๚ञटऐनঢ়ਯऋෟैऊ؛ ReLUप๚ञटऐनঢ়ਯऋ ऋෟ ෟैऊ؛ BERTपुGPT-3पुઞॎोथ थःॊ ॊऐनICLLRऊैमজ४ ४ख़ॡॺऔोथःॊཱུ؛ो؛ – BERTपुGPT-3पुઞॎोथःॊऐनICLRऊैमজ४ख़ॡॺऔोथःॊཱུ؛ो؛ • Swish [Ramachandran+, IC ICLR WS 2018] – ReLUप๚ञटऐनঢ়ਯऋෟैऊँ؛ोढथःॊऒधऋGeLUधॎैऩः؛ ऋෟै ैऊँ؛ोढथः • SELU [Klambauer+, NIPS 2 2017] – Self-Normalizing g Neural Networks॑੧घॊધभ؛ELU॑ਫभীदुभ Nettworkss॑੧घॊધभ भ؛ELU॑ਫभীदुभ ীदुਯखञुभ؛ – SELUঽ৬भ௬मষॎोथःऩः؛ਪഭुॶॵ॥॒दःॊभप୦ৢढञ؛ ःऩः ः؛ਪഭुॶॵ॥ • GLU [Dauphin+, ICML 2017] – LSTMभGateী॑ढथऌञણਙ৲ঢ়ਯ؛ – ఌٔણਙ৲ঢ়ਯق३ॢঔॖॻৢ॑كखञ0~1भக॑ुणGateधؚશಥੑखञ ఌभਏಞ؛ ಖ২পखथऋॉऽच॒ ऒऒदप؟ৰୡਏ • ૡ৾ಆୖ؟Text-to-Text Transfer Transformer (T5) – ॸय़५ॺ॑োोथॸय़५ॺ॑লघളਯभୖभञीभহ৾ಆ – 6.1TBभॸय़५ॺ॑ইॕঝॱজথॢखथ৺750GBपखञColossal Clean Crawled Corpus (C4)॑ਹ৷ – মધमઃभ3णभୖ఼॑৷ • QAृऩनभള়ॱ५ॡ SuperGLUE [Wang+, NeurIPS 2019] • ધभਏ৺ XSum [Narayan+, EMNLP 2018] • ସਖૢ௦ WebQuestions [Berant+, EMNLP 2013] • ਃ༊ᄽୖ؟WMT’14भஶஆᄽॱ५ॡ ऒऒदप؟ৰୡਏ उऒधॎॉ মધभৰୡमNLPୖदघ ણਙ৲ঢ়ਯपऽणॎॊৰୡટ • ঃছওॱشਯधੑऋ⑫अप৹ତ – Final loss = ધ১ঔॹঝਙચؚऒऒटऐ lower is better – SGLUE = ള়ୖ – XSum = ਏ৺ୖ – WebQ = ସਖૢ௦ – WMT EnDe = ਃ༊ᄽ • ટৢુ؟खथਙચऋखञু১ऋ૮ः – ਙચٙ୬ஊधढथःॊऐनखयखयୀढथःॊभदିਔ ಖ২ऋଐऎऩढञણਙ৲ঢ়ਯुँॊ • GLUभன [Shazeer, 2020] – GLUमఌٔણਙ৲ঢ়ਯपेॊ।ॺشधఌभਏಞ – GLUঽ৬मણਙ৲ঢ়ਯ॑३ॢঔॖॻधखथःञ • ਰৣभংজग़ش३ঙথ॑खथाञ – GeGLU؟ણਙ৲ঢ়ਯऋGeLUقReLUाञःऩभෟैऊऩृणك – ReGLU؟ણਙ৲ঢ়ਯऋReLU – SwiGLU؟ણਙ৲ঢ়ਯऋSwishقReLUाञःऩभෟैऊऩᄲك • ঃছওॱشਯ5400भதೋপୁঔॹঝPaLMदुਹ৷ [Chowdhery+, 2022] – LiGLU؟ણਙ৲ঢ়ਯऩखق൨ૄك – ँोؚELUृSELUधुੌ়चथाऩःभ…ء GLUभংজग़ش३ঙথभ௬ • ௬্১मৢॉ – Final loss = ધ১ঔॹঝਙચؚऒऒटऐ lower is better – SGLUE = ള়ୖ – XSum = ਏ৺ୖ – WebQ = ସਖૢ௦ – WMT EnDe = ਃ༊ᄽ • GLUधLiGLUमखৣऋॊટुँॊऋ… भংজग़ش३ঙথमฮखथટ॔জ 2. ਫૠ৲भఒभഄఴ • Vanilla Transformer؟LayerNorm [NIPS DLS 2016] – ঋॡॺঝभਏಞओधभधীങदਫૠ৲؛ • RMS [Zhang+Sennrich, NeurIPS 2019] – LayerNormऋःभद௷ऌदਫૠ৲؛ • ReZero [Bachlechner+, 2020] – ॱॖॺঝभলटखऋReZero is All You Need… লञे ٳٳis All You Need௺ધ؛ – ੪भTransformerقంડكभఌীप৾ಆ૭ચ ঃছওॱشįੂقக८ট॑كෳऐॊقకડ؛ك • Fixup [Zhang+, ICLR 2019] – ਫૠ৲॑જचङपؚResidualঈটॵॡभੂக॑0धऊ1पघॊट ऐदुBatchNormृLayerNormधःಖ২॑ਛ؛ ਫૠ৲पऽणॎॊৰୡટ • লৃ৭ু – RMS NormَLayerNormभੑଫऎघॊछُ – ReZeroَResidualঈটॵॡभఌী॑įੂقக0كघॊछُ – Fixupَੂகठू॒धઅइॊधਫૠ৲ਂਏटछُ ટभँढञু১मनोदखॆअऊء ਫૠ৲पऽणॎॊৰୡટ • লৃ৭ু – RMS NormَLayerNormभੑଫऎघॊछُ – ReZeroَResidualঈটॵॡभఌী॑įੂقக0كघॊछُ – Fixupَੂகठू॒धઅइॊधਫૠ৲ਂਏटछُ ટ • RMS Normभाટ॔জ – धःअऊReZeroऋপऩ૾ய 3. औपणःथभਫ਼ • కभFeed Forwardী – ఌजभڭମReLUମఌजभڮ – ঃছওॱشਯभॺঞॻشड़ই॑৹सञः • ৸৬भಽਯقకपउऐॊNك • ઌ॒রभীभઃ੪ਯ݀ • Multi-Head Attentionभঊॵॻਯܪ • ৹सञટ – Vanillaम 12 layers, ݀ = 3072, = ܪ12 – ಽਯऋः্ऋಖ২ଐऔऑटऋؚ1ଧਊञॉभ५ॸॵউੑऋः؛ 4. ඇीा্১पणःथभਫ਼ • InputधOutputमୁ៲¼ඇीाઃ੪भঃছওॱش – NLPटधঃছওॱشਯप୶ऋপऌः • ষഔীੰقALBERT [Lan+, ICLR 2020] ेॉك – ୁ៲¼ඇीाઃ੪ ମୁ៲¼ઃ੪धઃ੪¼ඇीाઃ੪ • ग़থ॥ॲشभোলৡदभඇीा [Chung+, ICLR 2021] – ુથقTiedكऊశુથقUntiedكऊ • ᄄ২पेॊඇीाઃ੪भ৹ତ [Baevski+Auli, ICLR 2019] – ᄄ২ऩୁमઃ੪भঋॡॺঝपඇीि • ৰୡટॹ؟॥ॲشभোলৡ॑ુથقग़থ॥ॲشधमశુથكघॊधس 5. ঃছওુॱشથ্১पणःथभਫ਼ • ALBERT [Lan+, ICLR 2020] ेॉ – ಽभঃছও॑ॱش৸ુથघॊ – ऺनभඇीाभীੰधુથुघ – ग़থ॥ॲشटऐ/ॹ॥ॲشटऐदુથ • ৰୡટ؟প৬ॲও – ञटखALBERTदमધभದ॑ਊथॊଷुোोथःॊऋؚऒभ ધदमৌधखथःऩःभदALBERTঽ৬धभૻຎदमऩः 6. Softmax • Adaptive Softmax [Joulin+, ICML 2017] – ୁभᄄ২पૢगथୁ៲॑ॡছ५ॱজথॢମమಽશदৈச৲ – ᄄ২ୁमऔैपೝ୶खथೄ৲ٔৈச৲ • Mixture of Softmaxes [Yang+, ICLR 2018] – Softmax॑ৢܭॉੑखथातऐਮपेॊহન૨ੑ • ৰୡટ؟ – Mixture of Softmaxesमؚਙચऋଐऎऩढञॱ५ॡुँॊऐन ੑச২ऋ40%ৣ 7. ৸৬भ॔شय़ॸॡॳকभఒभഄఴ Transparent Attention [Bapna+, EMNLP 2018] Evolved Transformer [So+, ICML 2019] Synthesizer variants [Tay+, 2020] Funnel Transformer [Dai+, NeurIPS 2020] Lightweight and Dynamic convolution [Wu+, ICLR 2019] Mixture of Experts Transformer [Shazeer+, NeurIPS 2018][Lepikhin+, ICLR 2021] • Switch Transformer [Fedus+, 2021] • Product Key Memory [Lample+, NeurIPS 2019] • Universal Transformer [Dehghani+, ICLR 2019] • • • • • • ৰୡટ ःेःेऎथऋীऊै॒ 7. ৸৬भ॔شय़ॸॡॳকभఒभഄఴ Transparent Attention [Bapna+, EMNLP 2018] Evolved Transformer [So+, ICML 2019] Synthesizer variants [Tay+, 2020] Funnel Transformer [Dai+, NeurIPS 2020] Lightweight and Dynamic convolution [Wu+, ICLR 2019] Mixture of Experts Transformer [Shazeer+, NeurIPS 2018][Lepikhin+, ICLR 2021] • Switch Transformer [Fedus+, 2021] • Product Key Memory [Lample+, NeurIPS 2019] • Universal Transformer [Dehghani+, ICLR 2019] • • • • • • ಖ২ऋଐऎऩढञुभ/பऎऩढञुभ Transparent Attention [Bapna+, EMNLP 2018] Evolved Transformer [So+, ICML 2019] Synthesizer variants [Tay+, 2020] Funnel Transformer [Dai+, NeurIPS 2020] Lightweight and Dynamic convolution [Wu+, ICLR 2019] Mixture of Experts Transformer [Shazeer+, NeurIPS 2018][Lepikhin+, ICLR 2021] • Switch Transformer [Fedus+, 2021] • Product Key Memory [Lample+, NeurIPS 2019] • Universal Transformer [Dehghani+, ICLR 2019] • • • • • • Product Key MemoryधSynthesizer variants • Product Key Memory [Dehghani+, • Synthesizer [Tay+, 2020] ICLR 2019] – পभkey॑શಥ৾ಆखऩऋै multi-head attention़ःऒध॑ ृॊPKMभ੧ – भಽभFFN॑PKMपइॊ धेؚॉ৵ૠெऩॿॵॺডشॡ दಖ২؞ச২up! – ᅚदमমધध[Wu+, ICLR 2019]टऐऋFacebookभધ قम৸थGoogle௺ك – ॔ॸথ३ঙথষഔ॑ܳ ୃ ܭऊै • োৡXभఌ+ReLU+ఌ पقDenseك • ुअೊਯदःःृقRandomك – Performer [Choromanski+, ICLR’21] म धभढ़ॿشঝद॔ॸথ३ঙথ ॑๚ Switch TransformerधMixture of Experts Transformer [Shazeer+, NeurIPS 2018][Lepikhin+, ICLR 2021][Fedus+, 2021] • Switch Transformer؟1.6ூभঃছওॱش – धୂऎधপजअटऋؚFFNऋളਯँॊقMixture of Expertsك – Switch Transformerदम৭උपऒभFFNभअठण॑৭व – भदؚ৸ঃছও॑ॱشभ৾ಆृपઞअॎऐदमऩः • – ऒोैभ৴भધम Googleपेॊुभ – ્पNorm Shazeerम • Transformerਉ෩भਸ਼2෩ • ऒोैभધपु෩ध खथোढथःॊ ऽधी • ফभTransformerभఒু১भপऋ੪भTransformerध ૻसथপ୷ऩः – ളਯୖदभሑ৷ਙऋ૮ःؚ९ش५॥ऺुॻشध॒नॎैऩः One possible explanation for this is that the originallyproposed Transformer architecture was near-perfect, and there wasn't much that could be done to improve it. قऒोमؚਊੂ੧औोञTransformerभ॔شय़ॸॡ ॳকऋ᭷पऎؚଐभऋँऽॉऩऊढञऒध ऋ৶धखथઅइैोऽघك؛ ෩ै – ँॉऋञःત؟ৗञऩఒু১॑અइञৎम َളਯৰಎ॑ঋش५पઞइَُCVुஅिളਯभୖद௬चेُ َঁॖঃشঃছও⑫॑ॱشइेَُਈଐகगूऩऎ+ীങُ जोदु؟ટऋનੳऔोञু১ • ৰम…Vanilla Transformerदुোढथःॊੵऋँॊ – LayerNormऋ ମLayerNormऋ [Baevski+Auli 2019][Xiong+, 2020] – ബৌகपेॊਜ਼઼ඇीा ମৼৌऩਜ਼઼ඇीा [Raffel+, 2019] • • • • ણਙ৲ঢ়ਯ؟GLUधGeLU/Swishभੌ়च ਫૠ৲؟RMS Norm ॹ॥ॲشपउऐॊোলৡभীങਠभુથ ॔شय़ॸॡॳকभੵ – – – – Mixture of Experts Transformer Switch Transformer Product key memory Synthesizer variants その他の Transformerのノウハウ Layer Normalization ऋऊऊਖ • Post-LN: ਙચऋৈःऋؚธಫऋਂ ମෙ୷ऋಥরदଷखथखऽअभऋਖ • Pre-LN: ธಫऋघॊऋؚਙચऋः ମোৡधলৡऋॉॎैऩःभऋਖ • Bottom-to-Top (B2T) connection [Takase+, 2022] – Post-LNभఌચৡधPre-LNेॉરोञธಫਙ॑য় • DeepNorm [Wang+, 2022] – Post-LNঋش५दؚResidualமਢऋजभऽऽق1كਸऔोॊभ॑ਯপऌऎघॊ – ঃছওॱشभੂகقभ॑كਯदસढथ৵औऎघॊ – 1000ಽभTransformerदुธಫदऌॊेअपऩढञ Positional EncoderभଐधWarmup Positional Encoder • ੪رमബৌਜ਼઼भඇीा • ৼৌਜ਼઼पेॊਜ਼઼ඇीा [Shaw+, NAACL’18] [Raffel+, 2019] • ബৌਜ਼઼भਜ਼઼ඇीाप३ইॺਂਙ॑ো [Kiyono+, EMNLP’21] • RoPE (Rotary Position Embedding) [Su+, 2021] ৾ಆ૨भWarmup • ਈੂम ݉ݑ݊_݁ݐݏपৌखथपੜਸ • ݉ݑ݊_݁ݐݏऋ݁ݐݏ_ݑ݉ݎܽݓधಉखऎऩॊध ݉ݑ݊_݁ݐݏभ্உपಗૻखथ0पሰ सऌଭಋभઓःল • OpenAIपेॊGPT-3द௴औोञ১ಋ [Henighan+, 2020] • ੑॱشॹؚ७ॵॺ१ॖ६ؚঃছওॱشऋݐपऩॊधଷऋ ܽ ݐ ৵औऎऩॊܽق, ܾमਯك णऽॉ… জ९ش५भःᄲऋण सऌଭಋभઓःল • GoogleपेॊPaLMदम… • ঃছওॱشऋ8Bମ62Bମ540Bधੜइञৎपశ৴ਢ৲ णऽॉ… ಌप জ९ش५भःᄲऋण লथऎॊऊुखोऩः௴ ೋপऩ੦ೕঔॹঝ॑পૠெॹॱش७ॵॺदธಫखथ ਢॱ५ॡ৾ಆदऌॊधऒौऋਘः॒ट… • Exploring the Limits of Large Scale Pre-training [Abnar+, ICLR’22] – JFT-300M॑ॱ५ॡधखथ৾ಆखञৎभಖ২ق૯ກك – vs. ਢःथৣदॱ५ॡ॑৾ಆखञधऌभಖ২قອກك ॱ५ॡभಖ২ऋ ऋढथुؚ ಥরऊैৣॱ५ॡ भಖ২म୍ઊठ औःओप Transformerभ੦মऩऊैૢ৷೧ؚ ਈभMLP௺ॿॵॺডشॡऽद॑ᖊᰐखञ ञ • जुजुTransformerढथء ࢞ • Transformerᄑ௯ • Transformerमड़ড॥থآء • Transformerभঀक़ঁक़ షವংॖ॔५धਏॹॱش؞ੑਃभ ॺঞॻشड़ই • ঋॡॺঝ॑શपఌ • ঋॡॺঝ॑ऽछथఌ ମ ॔ॸথ३ঙথमથटऐन ৸थगूऩः ࢞ CNN ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ RNN ࢞ ࢞ ࢞ ࢞ Transformer/MLP௺ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ౦ୂرऌघऍथଐऎীऊैऩःآधःअযभञीप • Transformerमॺشॡথृୁقঃॵॳऩन॑كResidualமਢखऩऋ ै2णभঔ४গشঝ॑ॉନखथःॊटऐآ – ॺشॡথ॑छॊToken Mixer – ॺشॡথ॑ఌघॊMLP • ਵपऩढञMLP௺ुৰम੦মपगଡୗآ ਡ॑ളਯॉନघ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ࢞ ঋॡॺঝ॑ ऽछथఌ ਫૠ৲ ࢞ ਫૠ৲ ࢞ ঋॡॺঝ॑ શपఌ ࢞ • …ँधम౦رधঀक़ঁक़ऋँॊभदਞ॑હऐऽखॆअ