Uploaded by boun.med.research

bbb

advertisement
Hi everyone, today I will present the paper an image is
worth sixteen by sixteen words: transformers for image
recognition at scale.
------------------------------ Next Slide -----------------------------Today’s outline is brief introduction to paper, why this
paper is important, recap on transformers in natural
language processing area, how vision transformers
works, comparison the proposed method with state-ofthe-art models. Transfer and few-shot evaluation on
ImageNet, performance vs. cost for different
architectures, inspecting visual transformers (attention
maps and so on) and we will conclude with main
takeaways.
------------------------------ Next Slide -----------------------------The paper released by Google and gained significant
attention from the deep learning community. For
example, this repository is not official, and it has
collected 1.2k stars and more than one hundred forks in
3 months. And the Karpathy shared a nice tweet about
that paper. It is accepted to ICLR 2021 with three accept
review which is seven points at average. The model does
not have any convolutional neural networks and it can
achieve
state-of-the-art
results
with
fewer
computational resources.
------------------------------ Next Slide -----------------------------Why this paper matters?
It can demonstrate the state-of-the-art accuracies with
less computation time for training. In detailed example,
the authors stated that it has decreased the training time
by eighty percent against noisy student model.
As I mentioned, the model requires less computation
because it contains only fully connected layers. So, the
models without using convolutional neural networks.
How can they achieve these results without CNNs? The
core mechanism behind the transformer architecture is
Self-Attention. It gives the capability to understand the
connection between inputs.
The third one is efficacy of transformer with small
patches. The paper has discovered that the model is able
to encode the distance of patches in the similarity of
position embeddings. Vit integrates the information
across the entire image even in the lowest layers in the
transformers.
------------------------------ Next Slide -----------------------------In the NLP, people generally use transformer models for
any propose, such as machine transformer, text
classification, text generation, name entity recognition
and so on. In the NLP transformer, there are two parts,
the first one is encoder and the second one is decoder.
Encoder encodes the inputs with also using positional
encoding. Please note that, in the nlp we are using sine
and cosine positional encoding. If you are interested,
please look at the more detailed papers. In the decoder
part, we will use the encoder part output and also the
computed decoder input. For example, in machine
translation, if we want to translate English text to French
text, we can feed the encoder with the English words and
the decoder with French words in the training phase. The
encoder encodes the English words meanings, and the
decoder provides the French grammar to these
meanings.
For vision transformer, they proposed only the encoder
part and they use the output of the encoder.
------------------------------ Next Slide -----------------------------Let’s continue with architecture of the vision
transformers.
First, split the images into grids. We can decide the patch
size for example sixteen by sixteen, as mentioned in the
paper name. Then, the patches are the words for the
model. The model applies the positional encoding as the
same in the nlp application. But here there is a different
approach. As I mentioned, in the nlp area, people use
sine and cosines. For this paper, the authors use
learnable parameters for positional encoding. Then, the
embeddings are fed to the encoder modules. As you
seen, the model puts an extra learnable embedding in
front of the patch embeddings. The reason of that is, they
want to predict of the image class. This approach is
commonly used in the nlp also. After encoding the patch
features, they put a mlp head for classification. In short,
the model architecture is similar to NLP and it is easy to
understand and use in a different problem.
------------------------------ Next Slide ------------------------------
They reached the state-of-the-art results on all dataset.,
even if they do not use convolutional neural networks.
The model pretrained on the JFT-300M dataset and
outperform the resnet based baselines, while taking
substantially less computational resources to pre-train.
On the other hand, we can also compare the TPU core
days, as seen in the last row, they used relatively less
computational power than the others.
------------------------------ Next Slide -----------------------------The strength of the method is present if the pre-training
dataset is used. As seen in the last dataset, that is JFT300M, vision transformer gives the best results. This
dataset is large and not publicly available.
There is the same situation for few-shot examples,
ResNet based networks give better results for few-shot.
However, if we used pretraining, vision transform gives
better results for large datasets.
------------------------------ Next Slide -----------------------------In this slide, the hybrid means we are using the ResNet
embeddings for each patch. Previously, the model takes
the raw batch, and it computes the embedding with fully
connected layers. As expected, the vision transformer
with ResNet is close the original vision transformer.
Moreover, the hybrid one gives better results for the
small models.
------------------------------ Next Slide -----------------------------They also investigated the middle layer of the model. In
the left figure, we can see the learned high-level features
by linear embedding layer. It is similar to CNN features.
In the center, they showed the learned positional
encodings for patches and you can see that the model
learned the patch location with the learnable layer. The
last figure, they showed the relation between head
number and the attention distance for the model. We are
observing that the 16 head attend different locations in
the image, so it is a representative model. And the right
figure, we can see the attention map for each prediction.
In the second row, there is a plane and model correctly
predicted the correct class, plane. When they inspect the
attention map of the model, they observed the right
attention map which highlights the plane not the
background.
------------------------------ Next Slide -----------------------------In conclusion, in the literature, there are some
transformers for vision, but the main strength of this
paper is that it is patch-based so it can reduce the
computational cost significantly than not only for pixelbased but also CNN based models.
Pre-training effects results substantially; however, you
may need gpu costs. The authors stated that they spend
a lot of TPU power.
This work may take over traditional convolutional neural
networks in the near future.
Some potential ideas: object segmentation and
detection.
And the official code is publicly available. You can reach
it from the link.
------------------------------ Next Slide -----------------------------Thanks for listening!
Download