ICCV 2025 Self-Supervised Learning self-attention skip connection condition number ill-conditioning token graying Vision Transformer

Always Skip Attention¶

Conference: ICCV 2025 arXiv: 2505.01996 Code: Not yet publicly available Area: Self-Supervised Learning / Vision Transformer / Theoretical Analysis Keywords: self-attention, skip connection, condition number, ill-conditioning, token graying, Vision Transformer

TL;DR¶

This paper theoretically demonstrates that the self-attention mechanism in Vision Transformers is inherently ill-conditioned, leading to training collapse in the absence of skip connections. It further proposes Token Graying (TG), a method that improves the condition number of input tokens to enhance ViT training stability and performance.

Background & Motivation¶

Background: Vision Transformers (ViTs) have achieved remarkable success in computer vision tasks. Their core architecture consists of Self-Attention Blocks (SABs), feed-forward networks (FFNs), and skip connections. While skip connections are universally adopted as a standard component, a rigorous theoretical explanation for their role in ViTs has remained elusive.

Limitations of Prior Work: - The authors identify an intriguing, previously unreported empirical phenomenon: removing the skip connection from the SAB causes catastrophic performance collapse in ViTs (a 22% drop on CIFAR-10), whereas removing the FFN skip connection results in only a mild degradation (a 2% drop). - This asymmetric dependency does not exist in CNNs (e.g., ConvMixer)—removing skip connections has virtually no impact on CNN performance (±0.2%). - The only prior work attempting to train Transformers without skip connections required 5× more training time, and the underlying reason was left unexplained.

Key Challenge: Why is the self-attention mechanism so critically dependent on skip connections, while other components (FFN, convolution) are not? What is the fundamental mechanism at play?

Goal: - Provide a theoretical explanation for SAB's extreme reliance on skip connections. - Characterize the true role of skip connections within SABs. - Propose a novel method to improve ViT training based on these theoretical insights.

Key Insight: The analysis is conducted through the lens of the matrix condition number, which measures the degree of ill-conditioning. A larger condition number implies a more ill-conditioned Jacobian matrix and less stable gradient-based training.

Core Idea: The triple matrix multiplication structure inherent to self-attention causes the condition number of the output embeddings to grow as the cube of the input condition number, leading to intrinsic ill-conditioning. Skip connections serve precisely to regularize this condition number.

Method¶

Overall Architecture¶

The paper's contributions are divided into two parts: 1. Theoretical Analysis: Proving that SAB output embeddings are intrinsically ill-conditioned, and characterizing how skip connections ameliorate this. 2. Token Graying (TG): A preprocessing method that improves the condition number of input tokens as a complementary enhancement to skip connections.

Key Designs¶

Ill-Conditioning Analysis of Self-Attention (Proposition 4.1):
- Function: Theoretically derives an upper bound on the condition number of SAB output embeddings (without skip connections).
- Mechanism: For linear attention (without softmax), the condition number of the SAB output \(\mathbf{XW_QW_K^TX^TXW_V}\) satisfies \(\kappa(\text{output}) \leq C \cdot (\sigma_{max}/\sigma_{min})^3\), i.e., the cube of the input matrix's condition number.
- Design Motivation: This explains why the condition number of SAB output embeddings reaches \(e^6\) experimentally, compared to \(e^3\) for FFN outputs. In contrast, the FFN's multiplicative structure \(\mathbf{W_{down}W_{up}X}\) yields a condition number upper bound that scales only linearly with the input condition number.
- Although the theoretical proof is based on linear attention, experiments confirm that softmax attention exhibits the same ill-conditioning behavior.
Regularizing Role of Skip Connections (Proposition 4.2):
- Function: Theoretically proves that skip connections substantially improve the conditioning of SAB outputs.
- Mechanism: \(\kappa(\mathbf{XM + X}) \ll \kappa(\mathbf{XM})\); adding the identity mapping dramatically reduces the condition number.
- This provides a rigorous mathematical explanation for why skip connections are indispensable to SABs: they function not merely as gradient highways, but as condition number regularizers.
Token Graying (TG) — SVD Variant:
- Function: Reconstructs input tokens via SVD decomposition, amplifying small singular values to improve the condition number.
- Mechanism: Applies SVD to the token matrix \(\mathbf{X = U\Sigma V^T}\), normalizes the singular values, and amplifies them with an exponent \(\epsilon \in (0,1]\) (i.e., \(\tilde{\Sigma} = (\Sigma/\max(\Sigma))^\epsilon\)), then reconstructs \(\tilde{\mathbf{X}} = \mathbf{U\tilde{\Sigma}V^T}\).
- Limitation: SVD computation is prohibitively expensive (approximately 6× slower training), making it impractical.
Token Graying (TG) — DCT Variant:
- Function: Approximates the effect of SVD using the Discrete Cosine Transform (DCT).
- Mechanism: In natural images, the dominant singular vectors of SVD typically correspond to low-frequency content, and DCT is inherently a frequency-domain transform. The formulation is \(\hat{\mathbf{X}} = D\mathbf{X}D^T\), with amplification applied in the frequency domain followed by IDCT reconstruction.
- Computational complexity is \(O(nd\log(nd))\) versus \(O(nd\min(n,d))\) for SVD, introducing negligible training overhead (0.732 days vs. 0.723 days baseline vs. 4.552 days for SVD).

Loss & Training¶

The method introduces no new loss functions. Standard cross-entropy (supervised) or MSE (MAE self-supervised pretraining) is used throughout. TG is applied as a preprocessing step prior to patch embedding, with a single hyperparameter \(\epsilon\) controlling the amplification magnitude (default: 0.95).

Key Experimental Results¶

Main Results¶

ImageNet-1K classification results across multiple ViT variants:

Model	Top-1 Acc (%)	Top-5 Acc (%)	With DCTTG
ViT-S	80.2	95.1	—
ViT-S + DCTTG	80.4	95.2	✓
ViT-B	81.0	95.3	—
ViT-B + DCTTG	81.3	95.4	✓
Swin-S	81.3	95.6	—
Swin-S + DCTTG	81.6	95.6	✓
CaiT-S	82.6	96.1	—
CaiT-S + DCTTG	82.7	96.3	✓
PVT V2 b3	82.9	96.0	—
PVT V2 b3 + DCTTG	83.0	96.1	✓

Self-supervised learning (MAE pretraining + fine-tuning):

Method	Top-1 Acc (%)	Top-5 Acc (%)
MAE	83.0	96.4
MAE + DCTTG	83.2	96.6

Ablation Study¶

Effect of different \(\epsilon\) values on ViT-B (SVD variant):

\(\epsilon\)	Top-1 Acc	\(\kappa_{in}\) (log)	\(\kappa_{out}\) (log)
— (baseline)	81.0	6.72	6.74
0.9	81.2	6.64	6.66
0.7	81.4	6.15	6.17
0.6	81.4	5.73	5.71
0.5	81.0	5.29	5.25

Skip connection ablation (ViT-Tiny, CIFAR-10):

Configuration	CIFAR-10 Acc	Notes
Standard (SAB+FFN skip)	~92%	Baseline
w/o FFN skip	~90%	Mild degradation (−2%)
w/o SAB skip	~70%	Catastrophic collapse (−22%)

Training time comparison:

Method	ViT-B Training Time (days)
Baseline	0.723
+ SVDTG	4.552
+ DCTTG	0.732

Key Findings¶

SAB vs. FFN Asymmetry: Removing the SAB skip connection causes the condition number to spike from \(e^3\) to \(e^6\), with training diverging after 30 epochs; removing the FFN skip has minimal impact.
CNNs Are Unaffected: ConvMixer exhibits performance changes of <±0.2% upon skip connection removal, confirming that this is a pathology specific to self-attention.
Condition Number Improvement Correlates with Performance: Performance peaks when \(\epsilon\) is reduced to 0.6–0.7, yielding the best condition numbers; excessively small \(\epsilon\) (e.g., 0.5) achieves good conditioning but may discard useful information.
DCT Is an Efficient Approximation of SVD: Performance is comparable, but with virtually zero additional training overhead.

Highlights & Insights¶

Profound Theoretical Insight: This work is the first to reveal, through the lens of condition numbers, the fundamental reason for self-attention's dependence on skip connections. The cubic growth of the condition number arising from SAB's triple matrix multiplication structure constitutes an elegant and empirically verifiable theoretical explanation.
Simple and Practical Improvement: DCT Token Graying requires no architectural modification, introduces no additional parameters, and incurs negligible training overhead (+1.2%). Its implementation is straightforward—a DCT-based frequency-domain amplification step applied before patch embedding.
Implications for ViT Design: This finding suggests a new optimization direction for ViTs: rather than designing more complex attention mechanisms, one may instead focus on improving the condition number within the self-attention computation. This may inspire novel attention normalization or initialization strategies.

Limitations & Future Work¶

Modest Performance Gains: Improvements on ImageNet-1K are limited to 0.2–0.4%. While consistent across architectures, the absolute gains are small.
Potential Issues with Low-Precision Training: DCT involves extensive multiplications and summations, which may be sensitive to quantization errors under low-precision arithmetic (FP16/BF16); this is not validated in the paper.
Theory Covers Only Linear Attention: Proposition 4.1 is formally proved only for linear attention; the extension to softmax attention is supported solely by empirical evidence.
Future Directions: Adaptive \(\epsilon\) scheduling strategies (varying across layers or training stages) could be explored. Condition number regularization could also be incorporated directly as part of the training objective.

vs. He et al. 2023 (Deep Transformers without Shortcuts): That work achieved skip-free training by introducing inductive biases into self-attention, but at the cost of 5× training time. The present analysis suggests that its success may be fundamentally attributed to implicit condition number improvement.
vs. ResNet Skip Connections: In CNNs, skip connections primarily alleviate vanishing gradients; in ViTs, they play the more critical role of condition number regularization. This explains why VGG (a skip-free CNN) remains trainable, whereas skip-free ViTs do not.
vs. cosFormer, Sigmoid Attention, and Other Attention Variants: These works replace softmax with alternative activation functions. The condition number perspective introduced here offers a new theoretical criterion for evaluating such alternatives.
This finding carries significant implications for large-scale ViT design: as model depth increases, the cumulative degradation of condition numbers may be a primary source of training instability in deep Transformers.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First to explain SAB's skip connection dependency through condition numbers; theoretically incisive.
Experimental Thoroughness: ⭐⭐⭐⭐ Validated across multiple architectures and tasks, though absolute performance gains are modest.
Writing Quality: ⭐⭐⭐⭐ Theoretical derivations are clear; experimental presentation is systematic.
Value: ⭐⭐⭐⭐ Theoretical contribution outweighs empirical gains, but insights for ViT design are of high practical value.