Quantum Doubly Stochastic Transformers¶

Conference: NeurIPS 2025 arXiv: 2504.16275 Code: None Area: Quantum Computing / Transformer Keywords: Variational quantum circuits, doubly stochastic matrices, attention mechanism, ViT, Birkhoff polytope

TL;DR¶

This paper proposes QDSFormer (Quantum Doubly Stochastic Transformer), replacing softmax with a variational quantum circuit QontOT to generate doubly stochastic attention matrices. Both theoretical analysis and experiments demonstrate that quantum-circuit-generated DSMs are more diverse and better at preserving information, consistently outperforming standard ViT and Sinkformer on multiple small-scale visual recognition tasks.

Background & Motivation¶

In Transformers, softmax produces row-stochastic attention matrices (row sums = 1), leading to various training instabilities:
- Entropy collapse: overly sharp attention causing vanishing gradients
- Rank collapse and token uniformity
- Eureka moments: sudden learning transitions in combinatorial problems
Sinkformer found that attention naturally tends toward doubly stochastic matrices (row and column sums = 1), and enforcing double stochasticity improves performance
Limitations of the Sinkhorn algorithm:
- Iterative approximation that is difficult to converge to a true DSM in practice (Frobenius distance still 0.23 at \(k=21\))
- Non-parametric, incapable of learning which DSM to return
- Requires non-negative inputs (via exponentiation), limiting expressivity
- Backpropagation gradients can be ill-conditioned
Key insight: The QontOT quantum circuit is proven to naturally produce DSMs (\(\mathbf{U} \odot \bar{\mathbf{U}} \in \Omega_n\)), and no known classical parametric method achieves the same property

Method¶

Overall Architecture¶

QDSFormer replaces softmax in ViT with the QontOT quantum circuit, feeding the attention matrix \(\mathbf{QK}^\top\) into the circuit to obtain a doubly stochastic attention matrix. The circuit exploits the property that the Hadamard product of a unitary matrix with its conjugate naturally yields a DSM, enabling flexible DSM generation through parameterized quantum gates.

Key Designs¶

QontOT Quantum Circuit for DSM Generation:
- Function: Maps unnormalized attention matrices to doubly stochastic matrices
- Mechanism: For any unitary matrix \(\mathbf{U}\), \(\mathbf{U} \odot \bar{\mathbf{U}} \in \Omega_n\). QontOT injects control circuit angles via the product of parameters \(\theta\) and data \(\mathbf{M}\), producing parameterized DSMs
- Design Motivation: The quantum circuit inherently guarantees the DSM property and is parametric (unlike non-parametric Sinkhorn), enabling learning of optimal DSMs
Expressivity Analysis:
- Function: Systematic comparison of DSM diversity among QontOT, Sinkhorn, and QR decomposition
- Mechanism: Enumerate inputs over a discretized hypercube and count the number of unique DSMs produced
- Key Result: QontOT (8 layers) produces a unique DSM for each input (approximately injective), while Sinkhorn and QR exhibit significant collision rates
QR Decomposition Doubly Stochastic Operator (Quantum-Inspired):
- Function: Serves as a classical quantum-inspired alternative
- Mechanism: Apply QR decomposition to \(\mathbf{M}\) to obtain a unitary matrix \(\mathbf{U}\), then compute \(\mathbf{U} \odot \bar{\mathbf{U}}\)
- Limitation: \(O(n^3)\) complexity and high collision rate, though it performs well in certain single-layer ViT settings

Loss & Training¶

Three circuit training strategies:
- Static: Uses fixed parameters obtained from quantum hardware experiments; no training required
- Mixed: Alternates 200 steps of gradient-free optimization every epoch
- Differentiable: End-to-end joint training (slowest; affected by Barren Plateaus)
The static strategy performs best or on par with optimized variants (potentially due to Barren Plateaus)
Uses 8×8 attention matrices, 16-layer circuits, and 4 ancilla qubits (16 qubits total)

Key Experimental Results¶

Main Results (Table)¶

Validation accuracy of a 2-layer ViT on FashionMNIST and MNIST:

Method	FashionMNIST	MNIST
Softmax	88.9 ± 0.1	98.1 ± 0.3
Softmax_σ²	84.6 ± 2.1	93.0 ± 4.6
QR	89.3 ± 0.1	98.3 ± 0.1
Sinkhorn	89.1 ± 0.7	98.2 ± 0.3
QontOT	90.0 ± 0.2	98.4 ± 0.1

MedMNIST (7 datasets): QontOT achieves best performance on 5 out of 7 datasets.

Ablation Study¶

Circuit depth: Performance begins to surpass ViT at 4–8 layers; additional layers yield logarithmic gains with diminishing returns beyond 16 layers
Static vs. optimized: Static configurations match or outperform end-to-end optimization, likely due to Barren Plateaus
Eureka experiment: QDSFormer exhibits earlier Eureka moments (sudden transitions in combinatorial reasoning) and more stable training
Sinkhorn iterations: Frobenius distance to the Birkhoff polytope is 0.84 at \(k=3\) and still 0.23 at \(k=21\); QontOT achieves < 5e-6

Key Findings¶

QontOT produces the highest number of unique DSMs over 43M input matrices, behaving near-injectively (near-zero information loss)
QontOT naturally exhibits higher attention entropy, mitigating entropy collapse during ViT training
Sinkhorn maps row-constant matrices (e.g., \(\mathbf{e}_2\mathbf{1}^\top\) and \(\mathbf{e}_4\mathbf{1}^\top\)) to the same DSM, resulting in information loss
Circuit size scales logarithmically as \(O(\log_2(T))\), theoretically favorable for long sequences

Highlights & Insights¶

This is the first work to apply the "DSM inductive bias" from quantum computing to Transformers, opening a design space unreachable by classical parametric methods
The expressivity analysis is rigorous and comprehensive, covering three dimensions: exhaustive enumeration, information preservation, and entropy analysis
The quantum-inspired QR decomposition method has independent value as a classical alternative
The finding that static circuits already outperform classical methods simplifies deployment, eliminating the need for quantum-classical hybrid training

Limitations & Future Work¶

Experiments are limited to small-scale datasets (MNIST-level) and small models (1–4 layer ViT); large-scale validation is absent
Quantum circuit simulation speed is a bottleneck; efficient execution of attention computation on real quantum hardware is currently infeasible
Attention matrix sizes must be powers of two (qubit constraints), requiring padding
Only encoder self-attention is validated; applicability to decoder and cross-attention remains unexplored

Sinkformer and ESPFormer/LOTFormer are representative classical doubly stochastic Transformers
The "\(\mathbf{U} \odot \bar{\mathbf{U}}\) naturally yields a DSM" property of QontOT constitutes a unique quantum inductive bias with no known classical equivalent
As quantum hardware matures, QDSFormer may demonstrate advantages at larger scales
This work inspires a new paradigm for incorporating physical constraints into neural networks

Rating¶

⭐⭐⭐⭐ — Theoretically novel with pioneering contributions at the quantum–ML intersection, but constrained by small-scale validation and the current maturity of quantum hardware