Hi everyone, today I will present the paper an image is worth sixteen by sixteen words: transformers for image recognition at scale. ------------------------------ Next Slide -----------------------------Today’s outline is brief introduction to paper, why this paper is important, recap on transformers in natural language processing area, how vision transformers works, comparison the proposed method with state-ofthe-art models. Transfer and few-shot evaluation on ImageNet, performance vs. cost for different architectures, inspecting visual transformers (attention maps and so on) and we will conclude with main takeaways. ------------------------------ Next Slide -----------------------------The paper released by Google and gained significant attention from the deep learning community. For example, this repository is not official, and it has collected 1.2k stars and more than one hundred forks in 3 months. And the Karpathy shared a nice tweet about that paper. It is accepted to ICLR 2021 with three accept review which is seven points at average. The model does not have any convolutional neural networks and it can achieve state-of-the-art results with fewer computational resources. ------------------------------ Next Slide -----------------------------Why this paper matters? It can demonstrate the state-of-the-art accuracies with less computation time for training. In detailed example, the authors stated that it has decreased the training time by eighty percent against noisy student model. As I mentioned, the model requires less computation because it contains only fully connected layers. So, the models without using convolutional neural networks. How can they achieve these results without CNNs? The core mechanism behind the transformer architecture is Self-Attention. It gives the capability to understand the connection between inputs. The third one is efficacy of transformer with small patches. The paper has discovered that the model is able to encode the distance of patches in the similarity of position embeddings. Vit integrates the information across the entire image even in the lowest layers in the transformers. ------------------------------ Next Slide -----------------------------In the NLP, people generally use transformer models for any propose, such as machine transformer, text classification, text generation, name entity recognition and so on. In the NLP transformer, there are two parts, the first one is encoder and the second one is decoder. Encoder encodes the inputs with also using positional encoding. Please note that, in the nlp we are using sine and cosine positional encoding. If you are interested, please look at the more detailed papers. In the decoder part, we will use the encoder part output and also the computed decoder input. For example, in machine translation, if we want to translate English text to French text, we can feed the encoder with the English words and the decoder with French words in the training phase. The encoder encodes the English words meanings, and the decoder provides the French grammar to these meanings. For vision transformer, they proposed only the encoder part and they use the output of the encoder. ------------------------------ Next Slide -----------------------------Let’s continue with architecture of the vision transformers. First, split the images into grids. We can decide the patch size for example sixteen by sixteen, as mentioned in the paper name. Then, the patches are the words for the model. The model applies the positional encoding as the same in the nlp application. But here there is a different approach. As I mentioned, in the nlp area, people use sine and cosines. For this paper, the authors use learnable parameters for positional encoding. Then, the embeddings are fed to the encoder modules. As you seen, the model puts an extra learnable embedding in front of the patch embeddings. The reason of that is, they want to predict of the image class. This approach is commonly used in the nlp also. After encoding the patch features, they put a mlp head for classification. In short, the model architecture is similar to NLP and it is easy to understand and use in a different problem. ------------------------------ Next Slide ------------------------------ They reached the state-of-the-art results on all dataset., even if they do not use convolutional neural networks. The model pretrained on the JFT-300M dataset and outperform the resnet based baselines, while taking substantially less computational resources to pre-train. On the other hand, we can also compare the TPU core days, as seen in the last row, they used relatively less computational power than the others. ------------------------------ Next Slide -----------------------------The strength of the method is present if the pre-training dataset is used. As seen in the last dataset, that is JFT300M, vision transformer gives the best results. This dataset is large and not publicly available. There is the same situation for few-shot examples, ResNet based networks give better results for few-shot. However, if we used pretraining, vision transform gives better results for large datasets. ------------------------------ Next Slide -----------------------------In this slide, the hybrid means we are using the ResNet embeddings for each patch. Previously, the model takes the raw batch, and it computes the embedding with fully connected layers. As expected, the vision transformer with ResNet is close the original vision transformer. Moreover, the hybrid one gives better results for the small models. ------------------------------ Next Slide -----------------------------They also investigated the middle layer of the model. In the left figure, we can see the learned high-level features by linear embedding layer. It is similar to CNN features. In the center, they showed the learned positional encodings for patches and you can see that the model learned the patch location with the learnable layer. The last figure, they showed the relation between head number and the attention distance for the model. We are observing that the 16 head attend different locations in the image, so it is a representative model. And the right figure, we can see the attention map for each prediction. In the second row, there is a plane and model correctly predicted the correct class, plane. When they inspect the attention map of the model, they observed the right attention map which highlights the plane not the background. ------------------------------ Next Slide -----------------------------In conclusion, in the literature, there are some transformers for vision, but the main strength of this paper is that it is patch-based so it can reduce the computational cost significantly than not only for pixelbased but also CNN based models. Pre-training effects results substantially; however, you may need gpu costs. The authors stated that they spend a lot of TPU power. This work may take over traditional convolutional neural networks in the near future. Some potential ideas: object segmentation and detection. And the official code is publicly available. You can reach it from the link. ------------------------------ Next Slide -----------------------------Thanks for listening!