A Unified Framework for Establishing the Universal Approximation of Transformer-Type Architectures¶
Conference: NeurIPS 2025 arXiv: 2506.23551 Code: None Area: Transformer Theory / Approximation Theory Keywords: Universal Approximation Property, Token Distinguishability, Attention Mechanism, Permutation Equivariance, Control Theory
TL;DR¶
A unified theoretical framework is established for proving the universal approximation property (UAP) of diverse Transformer architectures. The framework rests on two core conditions — nonlinear affine invariance of the feed-forward layer and token distinguishability of the attention layer — and leverages an analyticity assumption to reduce the latter to verification on only two-sample cases. The framework successfully covers a wide range of practical architectures, including softmax, RBF kernel, Performer, BigBird, Linformer, and others.
Background & Motivation¶
Transformers have achieved remarkable success in NLP, CV, and beyond, yet theoretical understanding of their expressive power has largely relied on constructive proofs tailored to specific architectures. Yun et al. (2019) proved the UAP of softmax-attention Transformers, and Zaheer et al. (2020) proved the UAP of BigBird; however, each newly proposed attention variant (kernel-based, sparse, low-rank, etc.) requires a bespoke constructive proof, demanding significant effort with no reusable technical blueprint.
This situation contrasts with UAP theory for ResNets: drawing on a control-theoretic perspective, Li et al. (2024) and Cheng et al. (2025) have shown that deep ResNets with Lipschitz nonlinear activations possess the UAP independently of specific network structure. Directly transferring this approach to Transformers, however, faces a fundamental obstacle — the feed-forward layer in Transformers is token-wise (the same transformation is applied independently to each token) and cannot directly model inter-token interactions. This means the attention mechanism must additionally bear the responsibility of producing distinct contextual representations for different inputs.
The central goal of this paper is to provide a set of sufficient conditions that are independent of the specific attention mechanism, such that any Transformer variant satisfying these conditions automatically possesses the UAP, and such that these conditions can be efficiently verified.
Method¶
Overall Architecture¶
The Transformer is abstracted as a deep composition of two alternating components: (1) a token-mixing layer \(g \in \mathcal{G}\) (abstracting attention), mapping \(\mathbb{R}^{d \times n}\) to itself to capture inter-token dependencies; and (2) a token-wise feed-forward layer \(h \in \mathcal{H}^{\otimes n}\), applying the same function \(\bar{h}: \mathbb{R}^d \to \mathbb{R}^d\) independently to each token. A Transformer block is defined as \((Id + h) \circ (Id + g)\), and the full model \(\mathcal{T}_{\mathcal{G},\mathcal{H}}\) is an arbitrary-depth composition of such blocks. The notion of \(G\)-UAP is defined on this abstraction to support general permutation subgroups \(G \leq S_n\).
Key Designs¶
-
Feed-Forward Layer Condition — Nonlinear Affine Invariance (Definition 2):
- Function: Ensures that deep compositions of feed-forward layers can approximate any continuous function on a single token.
- Mechanism: Requires \(\mathcal{H}\) to be closed under affine transformations (\(\forall h \in \mathcal{H}, W, A \in \mathbb{R}^{d \times d}, b \in \mathbb{R}^d\), \(Wh(A \cdot - b) \in \mathcal{H}\)) and to contain at least one non-affine Lipschitz function.
- Design Motivation: This is an extremely mild condition — nearly all practical feed-forward networks (with arbitrary activations and widths) satisfy it. Prior work (Cheng et al., 2025) has shown it is equivalent to the UAP of deep ResNets.
-
Attention Layer Condition — Token Distinguishability (Definition 3):
- Function: Ensures the attention mechanism produces mutually distinct contextual representations for all tokens across different inputs.
- Mechanism: For any finite dataset \(D = \{X_i\}_{i=1}^N\) in general position, there exists a finite-depth composition \(g \in (Id + \mathcal{G})^m\) such that samples \(X_i, X_j\) from distinct \(G\)-orbits have all tokens mapped to mutually distinct values.
- Design Motivation: If the attention mechanism cannot distinguish tokens, then the output is constant on a set of positive measure (since the feed-forward layer is token-wise), and the UAP necessarily fails. This condition is a relaxation of a necessary condition and is practically checkable as a sufficient one.
-
Analyticity Simplification (Theorem 2, Core Technical Contribution):
- Function: Reduces verification of token distinguishability from "for any finite set" to "for two samples."
- Mechanism: If \(\mathcal{G}\) is analytic with respect to its parameters \(\theta\), then failure of token distinguishability for some finite set \(|D| > 2\) implies that a certain analytic function vanishes identically on parameter space. Using the fact that the zero set of a nontrivial analytic function has measure zero, verification is reduced to the case \(|D| = 2\).
- Design Motivation: Direct verification of token distinguishability requires checking all finite sets (infinitely many). The analyticity assumption reduces the complexity to checking a single pair of samples, greatly lowering the verification burden.
Loss & Training¶
This is a purely theoretical work with no training procedure. The core theorem (Theorem 1) states: if \(\mathcal{H}\) satisfies nonlinear affine invariance and \(\mathcal{G}\) satisfies token distinguishability, then \(\mathcal{T}_{\mathcal{G},\mathcal{H}}\) possesses the \(G\)-UAP. An additional corollary states: when token distinguishability can be achieved within a fixed \(m\) attention layers, then \(m\) attention layers plus arbitrarily many feed-forward layers suffice to realize the UAP.
Key Experimental Results¶
Main Results¶
As a purely theoretical paper, the core contribution lies in applying the unified conditions to various practical architectures and verifying their UAP:
| Architecture | Attention Type | Kernel / Sparsity Pattern | UAP Type | Required Attention Layers | Prior Result |
|---|---|---|---|---|---|
| Original Transformer | Softmax kernel | \(k(x,y) = e^{x^\top y}\) | \(S_n\)-UAP | 1 layer | Known (Yun 2019) |
| Performer | Random feature kernel | \(k(x,y) = \phi(x)^\top\phi(y)\) | \(S_n\)-UAP (a.s.) | 1 layer | Known (Alberti 2023) |
| RBF Attention | Gaussian kernel | \(k(x,y) = e^{-\gamma\|x-y\|^2}\) | \(S_n\)-UAP | 1 layer | New result |
| BigBird | Sparse softmax | Sliding window + global + random | \(G\)-UAP | Connectivity hop count | Relaxed conditions |
| Longformer | Sliding window | $\mathcal{N}(i) = {j: | j-i | \leq w}$ | Reflection group-UAP |
| Linformer | Low-rank projection | \(EF\) projection matrix | UAP (no symmetry) | 1 layer | New result |
| SkyFormer | Nyström kernel | Approximate Gaussian kernel | \(S_n\)-UAP | 1 layer | New result |
Ablation Study¶
| Configuration | Key Metric | Description |
|---|---|---|
| Kernel limiting behavior | \(\lim_{t\to\infty} k(x,tW_Ky_1)/k(x,tW_Ky_2)\) | Requires divergence or convergence to zero to distinguish different tokens |
| Sparse attention connectivity | Graph connectivity | UAP is satisfied as long as tokens can reach each other in finite hops |
| Fixed attention layer count | \(m\) layers | Kernel-based attention requires only 1 layer; sparse attention requires "connectivity hop count" layers |
Key Findings¶
- The UAP condition for sparse attention reduces elegantly to graph connectivity — no technical conditions such as periodicity, Hamiltonian paths, or star substructures are required.
- For kernel-based attention, the UAP condition reduces to the limiting behavior of the kernel under key scaling — a natural "hard attention" approximation.
- One attention layer plus arbitrarily many feed-forward layers suffices for the UAP of kernel-based attention, quantifying the "memory capacity" of attention.
Highlights & Insights¶
- Exceptionally clean theoretical contribution: Two independently verifiable conditions plus an analyticity simplification technique cover nearly all mainstream Transformer variants, avoiding the tedium of architecture-specific constructions.
- Power of the non-constructive approach: There is no need to explicitly construct network parameters approximating target functions; verifying the conditions alone guarantees the existence of the UAP.
- "Token distinguishability" as a design-guiding concept: Any new attention mechanism that can distinguish all tokens of any two inputs within finitely many layers automatically inherits a UAP guarantee.
- Theoretical lower bound for sparse attention: The graph need only be connected (not complete), providing a theoretical guarantee for designing efficient sparse patterns — sparsity can be very aggressive as long as connectivity is maintained.
- New symmetry-preserving architectures: Attention designs preserving equivariance under the dihedral group \(D_n\) and cyclic group \(C_n\) are proposed, which are valuable for applications such as molecular and crystal structure prediction.
Limitations & Future Work¶
- The effect of normalization layers such as LayerNorm, widely used in practice, on the UAP is not considered.
- For certain attention mechanisms (e.g., linear attention \(k(x,y)=\phi(x)^\top\phi(y)\) where \(\phi\) is ReLU), the analyticity assumption does not hold, so Theorem 2 does not apply (though Theorem 1 can still be verified directly).
- The framework does not provide quantitative approximation rates — it establishes that approximation is possible but not how fast.
- Only fixed-length, sequence-to-sequence mappings over compact sets are addressed; variable-length inputs and autoregressive generation settings are not covered.
- The individual contributions of complex components such as multi-head attention and MoE to approximation efficiency are not analyzed.
Related Work & Insights¶
- vs. the constructive softmax UAP of Yun et al. (2019): The present result subsumes theirs as a special case and does not require a bias term in the query.
- vs. the sparse Transformer UAP of Zaheer et al. (2020) / Yun et al. (2020): Their results require technical conditions such as periodicity or star subgraphs; this paper requires only graph connectivity.
- vs. the single-layer attention memory capacity of Kajitsuka & Sato (2024): The present work generalizes to kernel-based and sparse attention.
- vs. the UAP for symmetric functions of Li et al. (2024): This paper extends results to non-transitive permutation groups and \(d\)-dimensional tokens (rather than 1-dimensional coordinates).
- Implications for architecture innovation: When designing new attention mechanisms, verifying the conditions of Theorem 1/2 suffices to obtain a UAP guarantee — providing a new development paradigm of "verify theoretical feasibility before engineering implementation."
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — Introduces control-theoretic and analyticity techniques into Transformer UAP analysis, establishing an elegant unified framework.
- Experimental Thoroughness: ⭐⭐⭐ — A purely theoretical paper; the generality of the framework is well demonstrated through corollaries for diverse architectures.
- Writing Quality: ⭐⭐⭐⭐⭐ — The Definition→Theorem→Corollary structure is clearly organized, with well-placed intuitive explanations.
- Value: ⭐⭐⭐⭐ — Provides theoretical foundations and UAP-guarantee design principles for Transformer architecture design.