Exact Expressive Power of Transformers with Padding¶
Conference: NeurIPS 2025 arXiv: 2505.18948 Code: None Area: LLM Reasoning Keywords: Transformer expressive power, padding, looping, circuit complexity, TC0
TL;DR¶
This paper provides an exact characterization of the expressive power of Transformers with padding: fixed depth combined with polynomial padding is precisely equivalent to \(\mathsf{FO}\)-uniform \(\mathsf{TC}^0\); further combined with \(O(\log^d n)\) looping, this is precisely equivalent to \(\mathsf{FO}\)-uniform \(\mathsf{TC}^d\); and polylog looping converges to \(\mathsf{NC}\). These results establish a complete theoretical foundation for padding and looping as parallel inference-time computation mechanisms.
Background & Motivation¶
Background: Transformers have inherent computational limitations — fixed-depth Transformers are confined to \(\mathsf{TC}^0\), the class of highly parallelizable problems. Chain-of-Thought (CoT) reasoning can extend beyond \(\mathsf{TC}^0\), but at the cost of sequential decoding, sacrificing parallelism.
Limitations of Prior Work: Although CoT is expressively powerful, it is slow at inference time. The question remains whether there exist inference-time computation methods that extend the expressive power of Transformers without increasing parameters and while preserving parallelism.
Key Challenge: It was previously known that padded Transformers have an upper bound of \(\mathsf{TC}^0\), but whether the entire class \(\mathsf{TC}^0\) is achievable (i.e., whether the lower bound matches) remained an open problem.
Goal: (a) Precisely characterize the expressive power of padded Transformers; (b) Understand how the combination of padding and looping systematically extends Transformer capabilities.
Key Insight: Rather than following the standard \(\mathsf{FO}[\mathsf{M}, \mathsf{bit}]\) route (where the bit predicate is difficult to simulate in Transformers), this work employs \(\mathsf{FO}+\mathsf{M}^2\) — first-order logic with pairwise majority quantifiers — which is equivalent to \(\mathsf{TC}^0\) but does not require bit.
Core Idea: A set of \(n^k\) padding tokens provides the Transformer with sufficient "storage space" to enumerate all \(n^k\) assignments of \(k\) variables, thereby enabling the resolution of arbitrary \(\mathsf{FO}+\mathsf{M}^2\) formulas.
Method¶
Overall Architecture¶
This is a purely theoretical work. The central contributions are two exact characterization theorems: - Theorem 1: \(\mathsf{AHAT}^0_* = \mathsf{FO}\text{-uniform } \mathsf{TC}^0\) (fixed depth + polynomial padding) - Theorem 2: \(\mathsf{AHAT}^d_* = \mathsf{FO}\text{-uniform } \mathsf{TC}^d\) (\(O(\log^d n)\) looping + polynomial padding)
The computational model is the Averaging-Hard-Attention Transformer (AHAT): hard attention (attending only to positions achieving the maximum attention score), masked pre-norm, and causal masking.
Key Designs¶
-
Padding → Variable Enumeration Storage (Lemma 2):
- Function: Proves that \(\mathsf{FO}+\mathsf{M}^2\) formulas with \(k\) distinct variables and nesting depth \(\ell\) are computable in \(\mathsf{uAHAT}^0_k\).
- Mechanism: The \(n^k\) padding tokens enumerate all \(n^k\) assignments of \(k\) variables. Each token \(v\) stores the truth values of subformulas under assignment \(v\). By induction, each layer evaluates one level of the formula:
- Constants/variables: positions are retrieved via the layer-norm hash \(\phi(z) = \langle z,1,-z,-1\rangle / \sqrt{2z^2+2}\)
- Existential quantifier \(\exists i. P\): token \(v\) attends to all \(v'\) agreeing with \(v\) except on variable \(i\); the mean yields \(c/n\), compared against \(1/(2n)\)
- Pairwise majority quantifier \(\mathsf{M}^2(i,j).P\): similarly, attends to all \(v'\) agreeing on all variables except \(i,j\), yielding \(c/n^2\), compared against \(1/2\)
- Design Motivation: Since \(\mathsf{FO}+\mathsf{M}^2\) is equivalent to \(\mathsf{FO}\text{-uniform } \mathsf{TC}^0\), demonstrating that padding can simulate all operators of this logic establishes the lower bound.
-
Reduction Framework for Transformer Analysis (Lemma 3):
- Function: Introduces the tools of reductions and complete problems from classical complexity theory into the analysis of Transformers.
- Mechanism: If a class \(\mathsf{C}\) has a complete problem \(L\) under \(\mathsf{R}\)-reductions, and padded Transformers can recognize \(L\) and compute all \(\mathsf{R}\)-reductions, then \(\mathsf{C} \subseteq \mathsf{AHAT}^d_*\).
- Design Motivation: This establishes a modular proof strategy — it suffices to show that a Transformer can solve one complete problem and implement the associated reductions to obtain recognizability for the entire class.
-
Looping for Depth Extension (Lemma 5):
- Function: Proves that \(O(\log^d n)\) looping combined with polynomial padding suffices to recognize all of \(\mathsf{FO}\text{-uniform } \mathsf{TC}^d\).
- Mechanism: Graph connectivity is \(\mathsf{NL}\)-complete under \(\mathsf{FO}\) reductions, and log-looped padded Transformers can solve graph connectivity (a known result). This implies that padded Transformers can implement \(\mathsf{L}\)-reductions. The completeness of wide-\(\mathsf{TC}^d\) circuit evaluation under \(\mathsf{L}\)-reductions then enables a chain of derivations.
- Design Motivation: Formalizes the intuition that padding controls width while looping controls depth.
-
Uniformity Collapse Theorem (Theorem 3):
- Function: Proves that for \(d \geq 1\), \(\mathsf{FO}\text{-uniform } \mathsf{TC}^d = \mathsf{L}\text{-uniform } \mathsf{TC}^d\).
- Mechanism: Since \(\mathsf{L}\) itself lies within \(\mathsf{FO}\text{-uniform } \mathsf{AC}^d\), the \(\mathsf{L}\)-computation used to construct circuits can be simulated by \(\mathsf{FO}\)-uniform circuits and composed with the evaluation circuit.
- Design Motivation: A new circuit complexity result obtained as a by-product, which strengthens Theorem 2.
Loss & Training¶
Not applicable — this is a purely theoretical work with no training component.
Key Experimental Results¶
Main Results¶
This paper contains no experiments. The central results are mathematical theorems:
| Theorem | Model | Equivalent Complexity Class |
|---|---|---|
| Thm 1 | Fixed depth + poly padding | \(\mathsf{FO}\)-uniform \(\mathsf{TC}^0\) |
| Thm 2 | \(O(\log^d n)\) loop + poly padding | \(\mathsf{FO}\)-uniform \(\mathsf{TC}^d\) |
| Cor 2.1 | Polylog loop + poly padding | \(\mathsf{FO}\)-uniform \(\mathsf{NC}\) |
| Thm 3 | Circuit complexity (\(d \geq 1\)) | \(\mathsf{FO}\)-uniform \(\mathsf{TC}^d = \mathsf{L}\)-uniform \(\mathsf{TC}^d\) |
Key Comparisons¶
| Inference-Time Computation | Expressive Power | Parallelism | Parameter Increase |
|---|---|---|---|
| None | \(\mathsf{TC}^0\) | Fully parallel | None |
| Poly padding | \(\mathsf{TC}^0\) (fills the entire class) | Fully parallel | None |
| Padding + log loop | \(\mathsf{TC}^1 \supseteq \mathsf{NL}\) | Highly parallel | None |
| Padding + polylog loop | \(\mathsf{NC}\) | Highly parallel | None |
| Chain of Thought | \(\supseteq \mathsf{P}\) | Sequential | None |
Key Findings¶
- Padding alone does not raise the complexity-class upper bound of Transformers (which remains \(\mathsf{TC}^0\)), but enables them to fill the entire class — previously only a proper subset was known to be reachable.
- Logarithmic precision and polynomial precision are equivalent in the padded/looped setting.
- Causal masking and unmasked Transformers are equivalent under padding (Proposition 1).
- Padding combined with polylog looping reaches \(\mathsf{NC}\), which is the theoretical upper bound for parallelism-preserving computation (unless \(\mathsf{NC} = \mathsf{P}\)).
Highlights & Insights¶
- First exact characterization of the expressive power of padded Transformers, resolving an open problem. The key technical insight is to bypass the hard-to-simulate
bitpredicate by instead working with the equivalent \(\mathsf{FO}+\mathsf{M}^2\) logic. - Introducing reduction and completeness tools into Transformer theory is an elegant methodological contribution: one complete problem plus reduction capability yields the entire complexity class. This methodology can be further applied to analyze new Transformer variants.
- The dual perspective of padding as "computational width" and looping as "computational depth" provides a clear conceptual framework for designing inference-time computation strategies.
- The uniformity collapse theorem is a novel circuit complexity result with independent value.
Limitations & Future Work¶
- The AHAT model (hard attention + masked pre-norm) is an idealization of standard soft-attention Transformers; the expressive power of practical models may differ.
- While the theory shows that padding is effective, how to train models to "learn to exploit padding" in practice remains an open question (existing empirical results are mixed).
- The trainability of looping in practice is questionable — appropriate parameterization and training strategies have yet to be resolved.
- The memory overhead arising from the increased context length induced by padding is not discussed.
- A fine-grained characterization of \(\mathsf{AHAT}^d_k\) for fixed \(k\) (i.e., with bounded padding) remains open.
Related Work & Insights¶
- vs. Merrill & Sabharwal (2023): The prior work established the upper bound \(\mathsf{AHAT} \subseteq \mathsf{TC}^0\); this paper completes the picture by establishing the matching lower bound.
- vs. Pfau (2024): That work posed the open question of padded Transformers vs. \(\mathsf{TC}^0\); this paper provides a complete answer.
- vs. CoT theory: CoT can reach beyond \(\mathsf{P}\) but sacrifices parallelism; padding combined with looping achieves \(\mathsf{NC}\) while preserving parallelism.
- vs. Merrill (2025): The result on looped Transformers solving graph connectivity is used as a key lemma in this work.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ Resolves an important open problem; introducing reduction/completeness tools into Transformer theory represents a methodological innovation.
- Experimental Thoroughness: ⭐⭐⭐ Purely theoretical with no experiments, but the theorem system is complete and self-consistent.
- Writing Quality: ⭐⭐⭐⭐⭐ The logical chain is clear, theorem statements are precise, and proof intuitions are well articulated.
- Value: ⭐⭐⭐⭐ Carries significant implications for understanding the computational power of Transformers and for designing inference-time computation strategies.