Always Skip Attention¶
Conference: ICCV 2025 arXiv: 2505.01996 Code: Not yet publicly available Area: Self-Supervised Learning / Vision Transformer / Theoretical Analysis Keywords: self-attention, skip connection, condition number, ill-conditioning, token graying, Vision Transformer
TL;DR¶
This paper theoretically demonstrates that the self-attention mechanism in Vision Transformers is inherently ill-conditioned, leading to training collapse in the absence of skip connections. It further proposes Token Graying (TG), a method that improves the condition number of input tokens to enhance ViT training stability and performance.
Background & Motivation¶
Background: Vision Transformers (ViTs) have achieved remarkable success in computer vision tasks. Their core architecture consists of Self-Attention Blocks (SABs), feed-forward networks (FFNs), and skip connections. While skip connections are universally adopted as a standard component, a rigorous theoretical explanation for their role in ViTs has remained elusive.
Limitations of Prior Work: - The authors identify an intriguing, previously unreported empirical phenomenon: removing the skip connection from the SAB causes catastrophic performance collapse in ViTs (a 22% drop on CIFAR-10), whereas removing the FFN skip connection results in only a mild degradation (a 2% drop). - This asymmetric dependency does not exist in CNNs (e.g., ConvMixer)—removing skip connections has virtually no impact on CNN performance (±0.2%). - The only prior work attempting to train Transformers without skip connections required 5× more training time, and the underlying reason was left unexplained.
Key Challenge: Why is the self-attention mechanism so critically dependent on skip connections, while other components (FFN, convolution) are not? What is the fundamental mechanism at play?
Goal: - Provide a theoretical explanation for SAB's extreme reliance on skip connections. - Characterize the true role of skip connections within SABs. - Propose a novel method to improve ViT training based on these theoretical insights.
Key Insight: The analysis is conducted through the lens of the matrix condition number, which measures the degree of ill-conditioning. A larger condition number implies a more ill-conditioned Jacobian matrix and less stable gradient-based training.
Core Idea: The triple matrix multiplication structure inherent to self-attention causes the condition number of the output embeddings to grow as the cube of the input condition number, leading to intrinsic ill-conditioning. Skip connections serve precisely to regularize this condition number.
Method¶
Overall Architecture¶
The paper's contributions are divided into two parts: 1. Theoretical Analysis: Proving that SAB output embeddings are intrinsically ill-conditioned, and characterizing how skip connections ameliorate this. 2. Token Graying (TG): A preprocessing method that improves the condition number of input tokens as a complementary enhancement to skip connections.
Key Designs¶
-
Ill-Conditioning Analysis of Self-Attention (Proposition 4.1):
- Function: Theoretically derives an upper bound on the condition number of SAB output embeddings (without skip connections).
- Mechanism: For linear attention (without softmax), the condition number of the SAB output \(\mathbf{XW_QW_K^TX^TXW_V}\) satisfies \(\kappa(\text{output}) \leq C \cdot (\sigma_{max}/\sigma_{min})^3\), i.e., the cube of the input matrix's condition number.
- Design Motivation: This explains why the condition number of SAB output embeddings reaches \(e^6\) experimentally, compared to \(e^3\) for FFN outputs. In contrast, the FFN's multiplicative structure \(\mathbf{W_{down}W_{up}X}\) yields a condition number upper bound that scales only linearly with the input condition number.
- Although the theoretical proof is based on linear attention, experiments confirm that softmax attention exhibits the same ill-conditioning behavior.
-
Regularizing Role of Skip Connections (Proposition 4.2):
- Function: Theoretically proves that skip connections substantially improve the conditioning of SAB outputs.
- Mechanism: \(\kappa(\mathbf{XM + X}) \ll \kappa(\mathbf{XM})\); adding the identity mapping dramatically reduces the condition number.
- This provides a rigorous mathematical explanation for why skip connections are indispensable to SABs: they function not merely as gradient highways, but as condition number regularizers.
-
Token Graying (TG) — SVD Variant:
- Function: Reconstructs input tokens via SVD decomposition, amplifying small singular values to improve the condition number.
- Mechanism: Applies SVD to the token matrix \(\mathbf{X = U\Sigma V^T}\), normalizes the singular values, and amplifies them with an exponent \(\epsilon \in (0,1]\) (i.e., \(\tilde{\Sigma} = (\Sigma/\max(\Sigma))^\epsilon\)), then reconstructs \(\tilde{\mathbf{X}} = \mathbf{U\tilde{\Sigma}V^T}\).
- Limitation: SVD computation is prohibitively expensive (approximately 6× slower training), making it impractical.
-
Token Graying (TG) — DCT Variant:
- Function: Approximates the effect of SVD using the Discrete Cosine Transform (DCT).
- Mechanism: In natural images, the dominant singular vectors of SVD typically correspond to low-frequency content, and DCT is inherently a frequency-domain transform. The formulation is \(\hat{\mathbf{X}} = D\mathbf{X}D^T\), with amplification applied in the frequency domain followed by IDCT reconstruction.
- Computational complexity is \(O(nd\log(nd))\) versus \(O(nd\min(n,d))\) for SVD, introducing negligible training overhead (0.732 days vs. 0.723 days baseline vs. 4.552 days for SVD).
Loss & Training¶
The method introduces no new loss functions. Standard cross-entropy (supervised) or MSE (MAE self-supervised pretraining) is used throughout. TG is applied as a preprocessing step prior to patch embedding, with a single hyperparameter \(\epsilon\) controlling the amplification magnitude (default: 0.95).
Key Experimental Results¶
Main Results¶
ImageNet-1K classification results across multiple ViT variants:
| Model | Top-1 Acc (%) | Top-5 Acc (%) | With DCTTG |
|---|---|---|---|
| ViT-S | 80.2 | 95.1 | — |
| ViT-S + DCTTG | 80.4 | 95.2 | ✓ |
| ViT-B | 81.0 | 95.3 | — |
| ViT-B + DCTTG | 81.3 | 95.4 | ✓ |
| Swin-S | 81.3 | 95.6 | — |
| Swin-S + DCTTG | 81.6 | 95.6 | ✓ |
| CaiT-S | 82.6 | 96.1 | — |
| CaiT-S + DCTTG | 82.7 | 96.3 | ✓ |
| PVT V2 b3 | 82.9 | 96.0 | — |
| PVT V2 b3 + DCTTG | 83.0 | 96.1 | ✓ |
Self-supervised learning (MAE pretraining + fine-tuning):
| Method | Top-1 Acc (%) | Top-5 Acc (%) |
|---|---|---|
| MAE | 83.0 | 96.4 |
| MAE + DCTTG | 83.2 | 96.6 |
Ablation Study¶
Effect of different \(\epsilon\) values on ViT-B (SVD variant):
| \(\epsilon\) | Top-1 Acc | \(\kappa_{in}\) (log) | \(\kappa_{out}\) (log) |
|---|---|---|---|
| — (baseline) | 81.0 | 6.72 | 6.74 |
| 0.9 | 81.2 | 6.64 | 6.66 |
| 0.7 | 81.4 | 6.15 | 6.17 |
| 0.6 | 81.4 | 5.73 | 5.71 |
| 0.5 | 81.0 | 5.29 | 5.25 |
Skip connection ablation (ViT-Tiny, CIFAR-10):
| Configuration | CIFAR-10 Acc | Notes |
|---|---|---|
| Standard (SAB+FFN skip) | ~92% | Baseline |
| w/o FFN skip | ~90% | Mild degradation (−2%) |
| w/o SAB skip | ~70% | Catastrophic collapse (−22%) |
Training time comparison:
| Method | ViT-B Training Time (days) |
|---|---|
| Baseline | 0.723 |
| + SVDTG | 4.552 |
| + DCTTG | 0.732 |
Key Findings¶
- SAB vs. FFN Asymmetry: Removing the SAB skip connection causes the condition number to spike from \(e^3\) to \(e^6\), with training diverging after 30 epochs; removing the FFN skip has minimal impact.
- CNNs Are Unaffected: ConvMixer exhibits performance changes of <±0.2% upon skip connection removal, confirming that this is a pathology specific to self-attention.
- Condition Number Improvement Correlates with Performance: Performance peaks when \(\epsilon\) is reduced to 0.6–0.7, yielding the best condition numbers; excessively small \(\epsilon\) (e.g., 0.5) achieves good conditioning but may discard useful information.
- DCT Is an Efficient Approximation of SVD: Performance is comparable, but with virtually zero additional training overhead.
Highlights & Insights¶
- Profound Theoretical Insight: This work is the first to reveal, through the lens of condition numbers, the fundamental reason for self-attention's dependence on skip connections. The cubic growth of the condition number arising from SAB's triple matrix multiplication structure constitutes an elegant and empirically verifiable theoretical explanation.
- Simple and Practical Improvement: DCT Token Graying requires no architectural modification, introduces no additional parameters, and incurs negligible training overhead (+1.2%). Its implementation is straightforward—a DCT-based frequency-domain amplification step applied before patch embedding.
- Implications for ViT Design: This finding suggests a new optimization direction for ViTs: rather than designing more complex attention mechanisms, one may instead focus on improving the condition number within the self-attention computation. This may inspire novel attention normalization or initialization strategies.
Limitations & Future Work¶
- Modest Performance Gains: Improvements on ImageNet-1K are limited to 0.2–0.4%. While consistent across architectures, the absolute gains are small.
- Potential Issues with Low-Precision Training: DCT involves extensive multiplications and summations, which may be sensitive to quantization errors under low-precision arithmetic (FP16/BF16); this is not validated in the paper.
- Theory Covers Only Linear Attention: Proposition 4.1 is formally proved only for linear attention; the extension to softmax attention is supported solely by empirical evidence.
- Future Directions: Adaptive \(\epsilon\) scheduling strategies (varying across layers or training stages) could be explored. Condition number regularization could also be incorporated directly as part of the training objective.
Related Work & Insights¶
- vs. He et al. 2023 (Deep Transformers without Shortcuts): That work achieved skip-free training by introducing inductive biases into self-attention, but at the cost of 5× training time. The present analysis suggests that its success may be fundamentally attributed to implicit condition number improvement.
- vs. ResNet Skip Connections: In CNNs, skip connections primarily alleviate vanishing gradients; in ViTs, they play the more critical role of condition number regularization. This explains why VGG (a skip-free CNN) remains trainable, whereas skip-free ViTs do not.
- vs. cosFormer, Sigmoid Attention, and Other Attention Variants: These works replace softmax with alternative activation functions. The condition number perspective introduced here offers a new theoretical criterion for evaluating such alternatives.
- This finding carries significant implications for large-scale ViT design: as model depth increases, the cumulative degradation of condition numbers may be a primary source of training instability in deep Transformers.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ First to explain SAB's skip connection dependency through condition numbers; theoretically incisive.
- Experimental Thoroughness: ⭐⭐⭐⭐ Validated across multiple architectures and tasks, though absolute performance gains are modest.
- Writing Quality: ⭐⭐⭐⭐ Theoretical derivations are clear; experimental presentation is systematic.
- Value: ⭐⭐⭐⭐ Theoretical contribution outweighs empirical gains, but insights for ViT design are of high practical value.