Enhancing Transformers Through Conditioned Embedded Tokens¶

Conference: ICCV 2025 arXiv: 2505.12789 Code: Unavailable Area: Image Segmentation / General Transformer Improvement Keywords: Transformer, condition number, self-attention, embedded tokens, optimization stability

TL;DR¶

This paper identifies an inherent ill-conditioning problem in the self-attention matrices of Transformers. Through theoretical analysis, it establishes a direct relationship between the condition number of the self-attention matrix and that of the embedded token matrix, and proposes Conditioned Embedded Tokens — an SVD-based correction term applied to the embedding matrix — achieving consistent performance improvements across image classification, object detection, instance segmentation, and NLP tasks.

Background & Motivation¶

Problem Definition¶

The core of Transformers is the self-attention mechanism, which models global dependencies via \(\mathbf{A}(X) = \text{softmax}(XW_QW_K^TX^T)XW_V\). The condition number of a matrix (the ratio of its largest to smallest singular value) is a critical indicator of optimization difficulty: a larger condition number leads to more unstable gradient optimization.

Limitations of Prior Work¶

Condition number studies in feed-forward networks are relatively mature (weight matrix preconditioning, NTK condition improvement), but the condition number problem in self-attention has received almost no attention.
Existing optimization improvements for Transformers (e.g., skip connections, multi-head attention) improve conditioning only indirectly, lacking a systematic theoretical framework.
In practice, the condition number of the embedded token matrix is often extremely large, causing gradient instability.

Core Idea¶

The upper bound on the condition number of the self-attention matrix is directly related to the condition number of the embedded token matrix \(X\) (scaling as \(\kappa(X)^3\) for linear attention). By applying SVD decomposition to \(X\) and adding a correction term \(C\) such that \(\kappa(X+C) \leq 2\), the ill-conditioning of self-attention can be substantially reduced.

Method¶

Overall Architecture¶

A correction matrix \(C \in \mathbb{R}^{N \times d}\) is added to the embedded token matrix \(X = [Ex_1 \cdots Ex_N]^T \in \mathbb{R}^{N \times d}\) at the first Transformer layer, such that the condition number of \(X+C\) is greatly reduced. The correction term \(C\) is computed via SVD decomposition of \(X\), and the corrected matrix is fed into the first Transformer layer.

Key Designs¶

1. Condition Number Analysis of Self-Attention¶

Function: Establishes theoretical upper bounds on the condition number of the self-attention matrix.
Core Result (Proposition 4.2):
- Linear attention: \(\kappa(\mathbf{LA}(X)) \leq \kappa(W_Q) \cdot \kappa(W_K) \cdot \kappa(W_V) \cdot \kappa(X)^3\)
- Softmax attention: \(\kappa(\mathbf{A}(X)) \leq \kappa(\text{softmax}(XW_QW_K^TX^T)) \cdot \kappa(X) \cdot \kappa(W_V)\)
Design Motivation: \(\kappa(X)\) is typically very large in practice and constitutes the primary bottleneck; reducing \(\kappa(X)\) simultaneously reduces the condition number of the entire self-attention operation.

2. Conditioned Embedded Tokens¶

Function: Constructs a correction matrix \(C\) such that \(\kappa(X+C) \leq 2\).
Core Theorem (Theorem 4.4): For any embedding matrix with \(\kappa(X) > 2\), there exists a \(C\) such that \(\kappa(X+C) \leq 2\).
Mechanism: SVD decomposition of \(X\) is performed, and the optimal correction term is constructed by adjusting the singular values. The correction is deterministic and introduces no additional learnable parameters.
Design Motivation: Even though the upper bound is an approximation, experiments demonstrate a strong correlation between reduced condition numbers and improved performance.

3. Cross-Layer Propagation Effect¶

Function: Validates that the conditioning improvement at the first layer propagates to subsequent layers.
Core Finding: Although the theoretical guarantee applies only to the first layer, experiments show that the average self-attention condition number is significantly reduced across all layers.
Design Motivation: The output of the first layer serves as input to the second, giving the conditioning improvement a cascading effect.

Loss & Training¶

No additional loss functions are introduced. The method serves as a drop-in replacement for the existing embedding layer, with all original training configurations unchanged.

Key Experimental Results¶

Main Results¶

ImageNet-1k Image Classification (Top-1 %)

Model	Baseline	+Conditioned	Gain
ViT-Base	80.3	81.3	+1.0
DeiT-Base	81.6	82.5	+0.9
Swin-Base	83.1	83.9	+0.8
XCiT-Medium	82.2	82.9	+0.7
DaViT-Base	83.6	84.6	+1.0

COCO Object Detection and Instance Segmentation (Mask R-CNN, AP)

Model	AP_box	AP50_box	AP_mask	AP50_mask
XCiT-S Baseline	44.9	66.1	40.1	63.1
XCiT-S +Cond.	45.7	66.6	40.4	63.5
XCiT-M Baseline	45.7	66.8	40.8	63.6
XCiT-M +Cond.	46.2	67.4	41.4	63.8

GLUE Benchmark (Crammed BERT, Accuracy)

Task	MNLI	SST-2	RTE	QNLI	QQP	MRPC	CoLA	GLUE Avg.
Baseline	83.8	92.3	55.1	90.1	87.3	85.0	48.9	78.6
+Cond.	84.2	92.5	55.6	91.1	87.4	86.3	53.7	79.7

Ablation Study¶

Condition Number Comparison Across Transformer Architectures (ViT-B, averaged over full training)

Metric	Baseline	+Conditioned
Embedded token \(\kappa(X)\)	~\(10^3\) order	~\(10^1\) order
Layer-1 attention \(\kappa\)	Significantly higher	Significantly reduced
All-layer attention \(\kappa\) (avg.)	Higher	Significantly reduced

GPT-2 Validation Loss (TinyStories)

Model	Val. Loss↓
GPT-2 Baseline	2.41
GPT-2 +Conditioned	2.36

Nyströmformer Long-Sequence Benchmark (LRA, Accuracy %)

Task	ListOps	Text	Retrieval	Image	Pathfinder
Baseline	37.1	63.8	79.8	39.9	72.9
+Cond.	37.9	64.9	80.9	40.1	73.3

Key Findings¶

Conditioning improvements are consistently effective across all tested architectures (ViT, DeiT, Swin, XCiT, DaViT, BERT, GPT-2, Nyströmformer).
The method benefits not only standard self-attention but also advanced attention mechanisms including shifted windows, cross-covariance, and Nyström approximations.
The conditioning improvement at the first layer cascades to all subsequent layers.
The method introduces zero additional parameters and no additional loss terms, functioning as a pure drop-in replacement.

Highlights & Insights¶

Effective integration of theory and practice: The condition number analysis provides clear theoretical motivation; although a complete proof from condition number to optimization convergence is yet to be established, experimental results consistently support the conclusions.
Strong generality: The same method is effective across CV, NLP, and long-sequence modeling, and can be directly embedded into various modern Transformer architectures.
Simple implementation: Only SVD decomposition of the embedding matrix and addition of the correction term are required, with no modifications to training configurations or introduction of hyperparameters.
Reveals a previously overlooked optimization bottleneck: The condition number of embedded tokens frequently reaches the order of \(10^3\), a problem that has received almost no prior attention.

Limitations & Future Work¶

Missing final theoretical step: The complete chain from condition number improvement → NTK improvement → accelerated optimization convergence has not been formally proven.
The condition number upper bound for softmax attention requires additional assumptions (the conditional assumption in Eq. 14).
SVD computational overhead: SVD decomposition of the embedding matrix is required at every forward pass, which may introduce non-trivial overhead in large-scale models.
Theoretical guarantees limited to the first layer: Although experiments show benefits across multiple layers, theoretical analysis for subsequent layers is absent.
Effects on very deep Transformers (e.g., LLMs) have not been sufficiently validated.

Weight conditioning [Saratchandran et al., 2025] applies preconditioning to feed-forward network weight matrices; the present work extends this idea to the attention mechanism.
NTK condition number analysis [Liu et al., 2022] establishes the relationship between condition numbers and gradient descent convergence, but is restricted to feed-forward networks.
Skip connections have been shown to improve the condition number of attention blocks [Ji et al., 2025]; the proposed method is complementary to this line of work.

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — First systematic analysis of condition numbers in self-attention with a theoretically motivated correction method.
Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Covers 5 CV models, 2 language models, 1 long-sequence model, and 4 tasks; highly comprehensive.
Writing Quality: ⭐⭐⭐⭐ — Theoretical derivations are clear and experiments are well organized.
Value: ⭐⭐⭐⭐ — High practical value as a drop-in improvement, though SVD overhead may limit applicability to large models.