Uploaded by Mirjalol Fayzullayev

ConViT+Presentation

advertisement
Convolutional like ViT architectures (ConViT)
ConViT: Improving Vision Transformers with Soft
Convolutional Inductive Biases
• ConViT
combines
the
benefits
of
convolutional architectures with the benefits
of vision transformer self-attention, capturing
long-range global context while maintaining
locality and inductive bias.
(2)
ConViT: Improving Vision Transformers with Soft
Convolutional Inductive Biases
What are the inductive biases in the CNNs?
• Inductive bias in CNN refers to the set of assumptions, constraints, or biases inherent in the design of these neural network structures. These biases
are intentionally built into the architecture to facilitate learning from input data and guide the model toward solutions that are likely to generalize well to
unseen data. Here are the details of some of the inductive biases in CNNs:
o Local Connectivity: Convolutional architectures are based on the idea of local connectivity, which assumes that neighboring input units in the
data are likely to have spatial correlations. By exploiting this assumption, convolutional layers apply filters (kernels) across small regions of the
input data, enabling the model to capture spatial patterns effectively. This reduces parameters and focuses learning on spatial patterns.
o Parameter Sharing: The same filter (set of weights) is applied across the entire image, assuming similar features exist in different locations.
This drastically reduces parameters and promotes translation invariance (recognizing objects regardless of position).
o Hierarchical Feature Learning: Stacked convolutional layers progressively extract higher-level features. Lower layers capture edges and
textures, while higher layers learn complex object parts and concepts.
o Pooling Operations: Pooling layers shrink the image size while retaining key information. This reduces sensitivity to small variations and allows
the network to focus on larger structures. It helps the network become more robust to small shifts in the position of objects within the image.
Even if an object moves slightly, the pooled features will likely remain similar, aiding in accurate recognition (translation invariance).
o Implicit Scale and Orientation Invariance: While not explicitly enforced, the hierarchical nature of CNNs encourages features that are robust
to variations in scale and orientation.
(3)
ConViT: Improving Vision Transformers with Soft
Convolutional Inductive Biases
(4)
ConViT: Improving Vision Transformers with Soft
Convolutional Inductive Biases
(5)
ConViT: Improving Vision Transformers with Soft
Convolutional Inductive Biases
Background
Positional Self-Attention (PSA) and GPSA
• Positional self-attention (PSA) is a technique used in
transformer-based models to incorporate relative position
information into the self-attention mechanism. This is in
contrast to the original transformer architecture, which relied
on adding absolute position embeddings to the input.
• The key idea behind PSA is to modify the self-attention
computation to include an additional term that encodes the
relative position between the query and key elements.
• By including the relative position term, the attention
mechanism can learn to attend more to patches that are
spatially close, which is particularly useful for pixel-wise
regression tasks like key point estimation and segmentation.
• The ConViT architecture extends the PSA by introducing a
“gated” version, where the relative position term is combined
with the standard self-attention term using a learned gate:
• Where σ(λ) is a learned sigmoid gate that controls the
balance between the standard self-attention and the
positional self-attention.
(6)
ConViT: Improving Vision Transformers with Soft
Convolutional Inductive Biases
Background
(7)
ConViT: Improving Vision Transformers with Soft
Convolutional Inductive Biases
• Adaptive attention span refers to the model’s capability to
dynamically adjust the range over which it attends to the
input tokens based on the context of the data. This feature
allows the model to focus on relevant information across
varying distances, enabling it to capture both local and
global dependencies efficiently.
• Positional gating involves regulating the attention paid by
different heads within the model to position versus content
information. By adjusting a gating parameter, each
attention head can control the balance between focusing
on positional details and content features. This mechanism
allows certain heads to prioritize spatial relationships, while
others concentrate more on the content of the input.
• In ConViT, these mechanisms are integrated to leverage
the benefits of both convolutional layers and selfattention. The model combines the locality-preserving
characteristics of convolutions with the flexibility of
self-attention, offering improved sample efficiency and
performance on tasks like ImageNet classification. By
incorporating adaptive attention span and positional
gating, ConViT demonstrates how these techniques can
enhance the learning capabilities of Vision Transformers,
making them more effective in capturing complex patterns
in visual data.
(8)
ConViT: Improving Vision Transformers with Soft
Convolutional Inductive Biases
(9)
ConViT: Improving Vision Transformers with Soft
Convolutional Inductive Biases
Investigating the role of Locality?
the ConViT paper investigates the role of locality in the following ways:
Quantifying Locality in Vanilla Self-Attention Layers:
o The paper first examines how locality is naturally encouraged in standard self-attention (SA) layers of Transformers.
o It shows that during the early stages of training, the non-locality metric (a measure of how much the model attends to distant tokens) decreases,
indicating that the model becomes more "convolutional" and focuses on local information.
o However, in the later stages of training, the non-locality metric starts to increase again, as the upper layers capture long-range dependencies.
Analyzing Locality Escape in GPSA Layers:
o The ConViT model introduces Gated Positional Self-Attention (GPSA) layers, which are initialized to mimic the locality of convolutional layers.
o The paper then examines how the GPSA layers escape this initial locality constraint during training.
o It shows that the non-locality metric in the GPSA layers increases throughout training, indicating that the model is able to learn to capture both
local and global dependencies.
o The paper also analyzes the dynamics of the gating parameters in the GPSA layers, which control the balance between attending to position
information and content information.
Investigating the Impact of Locality Strength:
o The paper performs ablation studies to understand the effects of the locality strength (α) and the number of GPSA layers on the performance of
the ConViT model.
o It finds that increasing both the locality strength and the number of GPSA layers leads to improved performance on the ImageNet dataset,
particularly in the early stages of training.
In summary, the ConViT paper thoroughly investigates the role of locality in Transformer-based models, demonstrating how the GPSA layers can
effectively combine the strengths of convolutional and self-attention architectures by leveraging soft convolutional inductive biases.
( 10 )
ConViT: Improving Vision Transformers with Soft
Convolutional Inductive Biases
( 11 )
Download