Convolutional like ViT architectures (ConViT) ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases • ConViT combines the benefits of convolutional architectures with the benefits of vision transformer self-attention, capturing long-range global context while maintaining locality and inductive bias. (2) ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases What are the inductive biases in the CNNs? • Inductive bias in CNN refers to the set of assumptions, constraints, or biases inherent in the design of these neural network structures. These biases are intentionally built into the architecture to facilitate learning from input data and guide the model toward solutions that are likely to generalize well to unseen data. Here are the details of some of the inductive biases in CNNs: o Local Connectivity: Convolutional architectures are based on the idea of local connectivity, which assumes that neighboring input units in the data are likely to have spatial correlations. By exploiting this assumption, convolutional layers apply filters (kernels) across small regions of the input data, enabling the model to capture spatial patterns effectively. This reduces parameters and focuses learning on spatial patterns. o Parameter Sharing: The same filter (set of weights) is applied across the entire image, assuming similar features exist in different locations. This drastically reduces parameters and promotes translation invariance (recognizing objects regardless of position). o Hierarchical Feature Learning: Stacked convolutional layers progressively extract higher-level features. Lower layers capture edges and textures, while higher layers learn complex object parts and concepts. o Pooling Operations: Pooling layers shrink the image size while retaining key information. This reduces sensitivity to small variations and allows the network to focus on larger structures. It helps the network become more robust to small shifts in the position of objects within the image. Even if an object moves slightly, the pooled features will likely remain similar, aiding in accurate recognition (translation invariance). o Implicit Scale and Orientation Invariance: While not explicitly enforced, the hierarchical nature of CNNs encourages features that are robust to variations in scale and orientation. (3) ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases (4) ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases (5) ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases Background Positional Self-Attention (PSA) and GPSA • Positional self-attention (PSA) is a technique used in transformer-based models to incorporate relative position information into the self-attention mechanism. This is in contrast to the original transformer architecture, which relied on adding absolute position embeddings to the input. • The key idea behind PSA is to modify the self-attention computation to include an additional term that encodes the relative position between the query and key elements. • By including the relative position term, the attention mechanism can learn to attend more to patches that are spatially close, which is particularly useful for pixel-wise regression tasks like key point estimation and segmentation. • The ConViT architecture extends the PSA by introducing a “gated” version, where the relative position term is combined with the standard self-attention term using a learned gate: • Where σ(λ) is a learned sigmoid gate that controls the balance between the standard self-attention and the positional self-attention. (6) ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases Background (7) ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases • Adaptive attention span refers to the model’s capability to dynamically adjust the range over which it attends to the input tokens based on the context of the data. This feature allows the model to focus on relevant information across varying distances, enabling it to capture both local and global dependencies efficiently. • Positional gating involves regulating the attention paid by different heads within the model to position versus content information. By adjusting a gating parameter, each attention head can control the balance between focusing on positional details and content features. This mechanism allows certain heads to prioritize spatial relationships, while others concentrate more on the content of the input. • In ConViT, these mechanisms are integrated to leverage the benefits of both convolutional layers and selfattention. The model combines the locality-preserving characteristics of convolutions with the flexibility of self-attention, offering improved sample efficiency and performance on tasks like ImageNet classification. By incorporating adaptive attention span and positional gating, ConViT demonstrates how these techniques can enhance the learning capabilities of Vision Transformers, making them more effective in capturing complex patterns in visual data. (8) ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases (9) ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases Investigating the role of Locality? the ConViT paper investigates the role of locality in the following ways: Quantifying Locality in Vanilla Self-Attention Layers: o The paper first examines how locality is naturally encouraged in standard self-attention (SA) layers of Transformers. o It shows that during the early stages of training, the non-locality metric (a measure of how much the model attends to distant tokens) decreases, indicating that the model becomes more "convolutional" and focuses on local information. o However, in the later stages of training, the non-locality metric starts to increase again, as the upper layers capture long-range dependencies. Analyzing Locality Escape in GPSA Layers: o The ConViT model introduces Gated Positional Self-Attention (GPSA) layers, which are initialized to mimic the locality of convolutional layers. o The paper then examines how the GPSA layers escape this initial locality constraint during training. o It shows that the non-locality metric in the GPSA layers increases throughout training, indicating that the model is able to learn to capture both local and global dependencies. o The paper also analyzes the dynamics of the gating parameters in the GPSA layers, which control the balance between attending to position information and content information. Investigating the Impact of Locality Strength: o The paper performs ablation studies to understand the effects of the locality strength (α) and the number of GPSA layers on the performance of the ConViT model. o It finds that increasing both the locality strength and the number of GPSA layers leads to improved performance on the ImageNet dataset, particularly in the early stages of training. In summary, the ConViT paper thoroughly investigates the role of locality in Transformer-based models, demonstrating how the GPSA layers can effectively combine the strengths of convolutional and self-attention architectures by leveraging soft convolutional inductive biases. ( 10 ) ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases ( 11 )