Skip to content

Exact Expressive Power of Transformers with Padding

Conference: NeurIPS 2025 arXiv: 2505.18948 Code: None Area: LLM Reasoning Keywords: Transformer expressive power, padding, looping, circuit complexity, TC0

TL;DR

This paper provides an exact characterization of the expressive power of Transformers with padding: fixed depth combined with polynomial padding is precisely equivalent to \(\mathsf{FO}\)-uniform \(\mathsf{TC}^0\); further combined with \(O(\log^d n)\) looping, this is precisely equivalent to \(\mathsf{FO}\)-uniform \(\mathsf{TC}^d\); and polylog looping converges to \(\mathsf{NC}\). These results establish a complete theoretical foundation for padding and looping as parallel inference-time computation mechanisms.

Background & Motivation

Background: Transformers have inherent computational limitations — fixed-depth Transformers are confined to \(\mathsf{TC}^0\), the class of highly parallelizable problems. Chain-of-Thought (CoT) reasoning can extend beyond \(\mathsf{TC}^0\), but at the cost of sequential decoding, sacrificing parallelism.

Limitations of Prior Work: Although CoT is expressively powerful, it is slow at inference time. The question remains whether there exist inference-time computation methods that extend the expressive power of Transformers without increasing parameters and while preserving parallelism.

Key Challenge: It was previously known that padded Transformers have an upper bound of \(\mathsf{TC}^0\), but whether the entire class \(\mathsf{TC}^0\) is achievable (i.e., whether the lower bound matches) remained an open problem.

Goal: (a) Precisely characterize the expressive power of padded Transformers; (b) Understand how the combination of padding and looping systematically extends Transformer capabilities.

Key Insight: Rather than following the standard \(\mathsf{FO}[\mathsf{M}, \mathsf{bit}]\) route (where the bit predicate is difficult to simulate in Transformers), this work employs \(\mathsf{FO}+\mathsf{M}^2\) — first-order logic with pairwise majority quantifiers — which is equivalent to \(\mathsf{TC}^0\) but does not require bit.

Core Idea: A set of \(n^k\) padding tokens provides the Transformer with sufficient "storage space" to enumerate all \(n^k\) assignments of \(k\) variables, thereby enabling the resolution of arbitrary \(\mathsf{FO}+\mathsf{M}^2\) formulas.

Method

Overall Architecture

This is a purely theoretical work. The central contributions are two exact characterization theorems: - Theorem 1: \(\mathsf{AHAT}^0_* = \mathsf{FO}\text{-uniform } \mathsf{TC}^0\) (fixed depth + polynomial padding) - Theorem 2: \(\mathsf{AHAT}^d_* = \mathsf{FO}\text{-uniform } \mathsf{TC}^d\) (\(O(\log^d n)\) looping + polynomial padding)

The computational model is the Averaging-Hard-Attention Transformer (AHAT): hard attention (attending only to positions achieving the maximum attention score), masked pre-norm, and causal masking.

Key Designs

  1. Padding → Variable Enumeration Storage (Lemma 2):

    • Function: Proves that \(\mathsf{FO}+\mathsf{M}^2\) formulas with \(k\) distinct variables and nesting depth \(\ell\) are computable in \(\mathsf{uAHAT}^0_k\).
    • Mechanism: The \(n^k\) padding tokens enumerate all \(n^k\) assignments of \(k\) variables. Each token \(v\) stores the truth values of subformulas under assignment \(v\). By induction, each layer evaluates one level of the formula:
      • Constants/variables: positions are retrieved via the layer-norm hash \(\phi(z) = \langle z,1,-z,-1\rangle / \sqrt{2z^2+2}\)
      • Existential quantifier \(\exists i. P\): token \(v\) attends to all \(v'\) agreeing with \(v\) except on variable \(i\); the mean yields \(c/n\), compared against \(1/(2n)\)
      • Pairwise majority quantifier \(\mathsf{M}^2(i,j).P\): similarly, attends to all \(v'\) agreeing on all variables except \(i,j\), yielding \(c/n^2\), compared against \(1/2\)
    • Design Motivation: Since \(\mathsf{FO}+\mathsf{M}^2\) is equivalent to \(\mathsf{FO}\text{-uniform } \mathsf{TC}^0\), demonstrating that padding can simulate all operators of this logic establishes the lower bound.
  2. Reduction Framework for Transformer Analysis (Lemma 3):

    • Function: Introduces the tools of reductions and complete problems from classical complexity theory into the analysis of Transformers.
    • Mechanism: If a class \(\mathsf{C}\) has a complete problem \(L\) under \(\mathsf{R}\)-reductions, and padded Transformers can recognize \(L\) and compute all \(\mathsf{R}\)-reductions, then \(\mathsf{C} \subseteq \mathsf{AHAT}^d_*\).
    • Design Motivation: This establishes a modular proof strategy — it suffices to show that a Transformer can solve one complete problem and implement the associated reductions to obtain recognizability for the entire class.
  3. Looping for Depth Extension (Lemma 5):

    • Function: Proves that \(O(\log^d n)\) looping combined with polynomial padding suffices to recognize all of \(\mathsf{FO}\text{-uniform } \mathsf{TC}^d\).
    • Mechanism: Graph connectivity is \(\mathsf{NL}\)-complete under \(\mathsf{FO}\) reductions, and log-looped padded Transformers can solve graph connectivity (a known result). This implies that padded Transformers can implement \(\mathsf{L}\)-reductions. The completeness of wide-\(\mathsf{TC}^d\) circuit evaluation under \(\mathsf{L}\)-reductions then enables a chain of derivations.
    • Design Motivation: Formalizes the intuition that padding controls width while looping controls depth.
  4. Uniformity Collapse Theorem (Theorem 3):

    • Function: Proves that for \(d \geq 1\), \(\mathsf{FO}\text{-uniform } \mathsf{TC}^d = \mathsf{L}\text{-uniform } \mathsf{TC}^d\).
    • Mechanism: Since \(\mathsf{L}\) itself lies within \(\mathsf{FO}\text{-uniform } \mathsf{AC}^d\), the \(\mathsf{L}\)-computation used to construct circuits can be simulated by \(\mathsf{FO}\)-uniform circuits and composed with the evaluation circuit.
    • Design Motivation: A new circuit complexity result obtained as a by-product, which strengthens Theorem 2.

Loss & Training

Not applicable — this is a purely theoretical work with no training component.

Key Experimental Results

Main Results

This paper contains no experiments. The central results are mathematical theorems:

Theorem Model Equivalent Complexity Class
Thm 1 Fixed depth + poly padding \(\mathsf{FO}\)-uniform \(\mathsf{TC}^0\)
Thm 2 \(O(\log^d n)\) loop + poly padding \(\mathsf{FO}\)-uniform \(\mathsf{TC}^d\)
Cor 2.1 Polylog loop + poly padding \(\mathsf{FO}\)-uniform \(\mathsf{NC}\)
Thm 3 Circuit complexity (\(d \geq 1\)) \(\mathsf{FO}\)-uniform \(\mathsf{TC}^d = \mathsf{L}\)-uniform \(\mathsf{TC}^d\)

Key Comparisons

Inference-Time Computation Expressive Power Parallelism Parameter Increase
None \(\mathsf{TC}^0\) Fully parallel None
Poly padding \(\mathsf{TC}^0\) (fills the entire class) Fully parallel None
Padding + log loop \(\mathsf{TC}^1 \supseteq \mathsf{NL}\) Highly parallel None
Padding + polylog loop \(\mathsf{NC}\) Highly parallel None
Chain of Thought \(\supseteq \mathsf{P}\) Sequential None

Key Findings

  • Padding alone does not raise the complexity-class upper bound of Transformers (which remains \(\mathsf{TC}^0\)), but enables them to fill the entire class — previously only a proper subset was known to be reachable.
  • Logarithmic precision and polynomial precision are equivalent in the padded/looped setting.
  • Causal masking and unmasked Transformers are equivalent under padding (Proposition 1).
  • Padding combined with polylog looping reaches \(\mathsf{NC}\), which is the theoretical upper bound for parallelism-preserving computation (unless \(\mathsf{NC} = \mathsf{P}\)).

Highlights & Insights

  • First exact characterization of the expressive power of padded Transformers, resolving an open problem. The key technical insight is to bypass the hard-to-simulate bit predicate by instead working with the equivalent \(\mathsf{FO}+\mathsf{M}^2\) logic.
  • Introducing reduction and completeness tools into Transformer theory is an elegant methodological contribution: one complete problem plus reduction capability yields the entire complexity class. This methodology can be further applied to analyze new Transformer variants.
  • The dual perspective of padding as "computational width" and looping as "computational depth" provides a clear conceptual framework for designing inference-time computation strategies.
  • The uniformity collapse theorem is a novel circuit complexity result with independent value.

Limitations & Future Work

  • The AHAT model (hard attention + masked pre-norm) is an idealization of standard soft-attention Transformers; the expressive power of practical models may differ.
  • While the theory shows that padding is effective, how to train models to "learn to exploit padding" in practice remains an open question (existing empirical results are mixed).
  • The trainability of looping in practice is questionable — appropriate parameterization and training strategies have yet to be resolved.
  • The memory overhead arising from the increased context length induced by padding is not discussed.
  • A fine-grained characterization of \(\mathsf{AHAT}^d_k\) for fixed \(k\) (i.e., with bounded padding) remains open.
  • vs. Merrill & Sabharwal (2023): The prior work established the upper bound \(\mathsf{AHAT} \subseteq \mathsf{TC}^0\); this paper completes the picture by establishing the matching lower bound.
  • vs. Pfau (2024): That work posed the open question of padded Transformers vs. \(\mathsf{TC}^0\); this paper provides a complete answer.
  • vs. CoT theory: CoT can reach beyond \(\mathsf{P}\) but sacrifices parallelism; padding combined with looping achieves \(\mathsf{NC}\) while preserving parallelism.
  • vs. Merrill (2025): The result on looped Transformers solving graph connectivity is used as a key lemma in this work.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ Resolves an important open problem; introducing reduction/completeness tools into Transformer theory represents a methodological innovation.
  • Experimental Thoroughness: ⭐⭐⭐ Purely theoretical with no experiments, but the theorem system is complete and self-consistent.
  • Writing Quality: ⭐⭐⭐⭐⭐ The logical chain is clear, theorem statements are precise, and proof intuitions are well articulated.
  • Value: ⭐⭐⭐⭐ Carries significant implications for understanding the computational power of Transformers and for designing inference-time computation strategies.