How Do Vision Transformers Work?

Introduction

Vision Transformers (ViTs) leverage multi-headed self-attention (MSA) as their core mechanism, but the reasons for their superior performance over traditional convolutions are not always clear. This summary explores the optimization benefits of MSA, highlighting how its unique inductive biases shape the learning process. Notably, ViTs often struggle on small datasets, not due to overfitting, but because of optimization challenges.

MSAs can be seen as large, data-specific kernels. Weak inductive bias may not always boost predictive performance, introducing appropriate constraints can help ViTs learn stronger representations. For example, local MSAs outperform global MSAs on both small and large datasets. Global attention can mix signals in arbitrary ways, making the loss surface more non-convex (not valid for text tasks) (Que: in tasks like scene understanding or object detection, global attention may be beneficial?)

Key properties of MSAs from previous research:

MSAs enhance the predictive performance of CNNs.
ViTs are robust to data corruptions, occlusions, and adversarial attacks.
MSAs near the final layers significantly improve predictions by ensembling and stabilizing processed features.

MSAs act as trainable, spatially varying smoothing operators, adaptively smoothing features based on input. This smoothing effect is central to their effectiveness. Smoothing reduces variance, making models less sensitive to parameter changes and lowering the magnitude of Hessian eigenvalues (which measure the curvature of the loss landscape).

How MSAs Improve Optimization

Stronger inductive bias yields stronger representations: Experiments show that stronger inductive bias leads to lower test error and negative log-likelihood (NLL), indicating ViTs do not overfit (as NLL increases with weaker bias => doesn't fit well to training data).
ViTs do not overfit small datasets: As dataset size decreases, test error and NLL both increase, showing that overfitting is not the main issue.
Non-convex loss landscapes can hurt ViT performance: ViTs have more non-convex loss surfaces than ResNets, with larger Hessian eigenvalues. However, large datasets help suppress these eigenvalues, making the loss more convex. Techniques like global average pooling (GAP) and sharpness-aware minimization (SAM) can further smooth the loss and improve performance.
MSAs flatten the loss landscape: MSAs reduce the magnitude of eigenvalues of Hessian. Hessian represents the local curvature of a loss function => MSAs smooth the loss landscape.

MSAs vs. Convolutions

Convolutions: Data-agnostic and channel-specific.
MSAs: Data-specific and channel-agnostic.

When analyzing ViT feature maps in the frequency domain, MSAs consistently reduce high-frequency amplitudes, acting as low-pass filters (smoothing out rapid variations). In contrast, MLPs in convolutional networks tend to increase high-frequency components. Low-frequency signals correspond to image shapes, while high-frequency signals relate to texture.

MSAs aggregate and average feature maps, reducing their variance and effectively ensembling features. In contrast, convolutions do not perform this aggregation. Notably, feature map variance tends to accumulate and peak at the end of each network stage.

All these observations suggests that MSAs can complement Convs.

Combining MSAs and Convolutions

Hybrid Architecture

AlterNet Block Diagram

Lesion studies show that placing MSAs near the end of each stage significantly boosts performance. The proposed hybrid architecture, called AlterNet, alternates between Conv blocks and MSA blocks:

Replace Conv blocks with MSA blocks, starting from the end of a baseline CNN.
If an added MSA block does not improve performance, try replacing a Conv block earlier in the network.
Use more attention heads and higher hidden dimensions for MSA blocks in later stages.

AlterNet Architecture

Key findings:

MSAs at the end of stages improves accuracy.
Too many MSAs in early stages can harm accuracy, but MSAs at the end of intermediate stages (e.g., c2) are beneficial.
MSAs in AlterNet suppress large Hessian eigenvalues, flattening the loss landscape more than ResNet during early training.

Discussion

MSAs are not just generalized convolutions—they are generalized spatial smoothing operators. By flattening the loss landscape and ensembling feature points, MSAs help ViTs learn strong, robust representations.

Limitations and Future Work

Analyze MSAs to other vision tasks like object detection, segmentation, video or multimodal learning.
Explore other architectures combining MSAs and Convs.
States strong inductive bias is beneficial, is it true for all type of image tasks?