AbbIE: Autoregressive Block-Based Iterative Encoder for Efficient Sequence Modeling¶

Conference: NeurIPS 2025 arXiv: 2507.08567 Code: None Area: LLM Inference Keywords: recurrent Transformer, iterative encoder, test-time scaling, fixed point, upward generalization

TL;DR¶

This paper proposes AbbIE, an architecture that recursively iterates the intermediate layers (Body) of a decoder-only Transformer. Trained with only 2 iterations, AbbIE achieves upward generalization at inference time by increasing the number of iterations, surpassing standard Transformers on both language modeling perplexity and zero-shot ICL benchmarks, while serving as a drop-in replacement for standard Transformers.

Background & Motivation¶

Background: Transformer performance has traditionally been improved by scaling model parameters and training data (scaling laws). Test-time scaling has recently emerged as a new direction, but existing recurrent Transformers (e.g., Geiping et al. 2025) require training with many iterations and are typically limited to specific tasks.

Limitations of Prior Work: (1) GPU memory growth lags behind compute growth, constraining model scale expansion; (2) existing recurrent Transformers incur high training costs (requiring many iterations) and cannot serve as general-purpose replacements for standard Transformers; (3) most recurrent methods fail to generalize beyond the number of training iterations at inference time (upward generalization failure).

Key Challenge: How can Transformers be endowed with test-time compute scaling capability without substantially increasing training cost?

Goal: Design a recurrent Transformer such that: (a) it is equivalent to a standard Transformer at a single iteration; (b) it requires only 2 training iterations; (c) it can scale to an arbitrary number of inference iterations with continuously improving performance.

Key Insight: The authors observe that the residual stream of a Transformer naturally injects original input information into every layer, which may be sufficient to achieve Path Independence (convergence to a fixed point), enabling recursive iteration without additional projection matrices.

Core Idea: The Transformer is partitioned into Head–Body–Tail segments; only the Body is iteratively applied. An inter-iteration residual connection ensures convergence, and only 2 training iterations are needed to achieve upward generalization at inference time.

Method¶

Overall Architecture¶

Input tokens are mapped via embedding and the Head (\(N_h\) Transformer blocks) into concept space. The Body (\(N_b\) blocks) is then applied recursively for \(r\) iterations, after which the Tail (\(N_t\) blocks) maps the representations back to token space for unembedding. At \(r=1\), AbbIE is exactly equivalent to a standard Transformer.

Key Designs¶

Head–Body–Tail Partition:
- Function: The Transformer layers are divided into three groups — the Head handles tokenization to concept space, the Body performs iterative reasoning, and the Tail maps from concept space back to token space.
- Mechanism: The Head and Tail are each applied once; the Body is applied \(r\) times. This partition is motivated by the concept space theory of Kaplan et al. 2024, which posits that the early layers perform de-tokenization, the late layers perform re-tokenization, and the intermediate layers operate in concept space.
- Design Motivation: Applying recursion to the entire model would corrupt the tokenization process; iteration should only occur at the appropriate level of abstraction.
AbbIE-D (Diffusion-inspired variant):
- Function: Adds an inter-iteration residual connection between consecutive Body iterations.
- Mechanism: \(h_{k+1} = B(h_k) + h_k\), where \(B(\cdot)\) denotes the Body component. In contrast to AbbIE-C, which uses \(h_{k+1} = B(h_k)\) (relying solely on intra-Body residuals), AbbIE-D increases the relative contribution of the original input signal, preventing the \(h_0\) signal from being diluted across iterations.
- Design Motivation: Achieving Path Independence (fixed-point convergence) requires a sufficiently strong original input signal at each iteration. Experiments confirm that AbbIE-C diverges while AbbIE-D converges.
Only 2 Training Iterations:
- Function: Training uses only \(r=2\) (Body applied twice), yet inference can employ \(r=4, 8\), or more iterations.
- Mechanism: Because AbbIE-D satisfies the fixed-point property, 2 training iterations are sufficient for the model to learn how to exploit additional iterations for improved representations. Larger models (350M) achieve the lowest perplexity at \(r=4\), demonstrating successful upward generalization.
- Design Motivation: Reduces training cost to near that of a standard Transformer while retaining test-time scaling capability.

Loss & Training¶

Standard next-token prediction (NLL). AdamW optimizer (\(\beta_1=0.9\), \(\beta_2=0.95\)) with a Warmup-Stable-Decay learning rate schedule. Training token budget is 20 tokens per parameter (compute-optimal). All models use tied embeddings.

Key Experimental Results¶

Main Results¶

Benchmark	Metric	AbbIE-D (r=8)	AbbIE-D (r=2)	Std	0pt (r=2)
HellaSwag (350M)	Acc	36.6	33.8	30.1	29.7
LAMBADA (350M)	Acc	30.8	29.8	24.2	22.2
ARC-Easy (350M)	Acc	53.2	48.9	45.6	46.3
CommonsenseQA (350M)	Acc	23.7	20.0	20.0	20.0

Note: On CommonsenseQA, both the standard Transformer and 0pt remain at the random baseline (20%), while AbbIE-D at \(r=8\) exceeds the random baseline, suggesting that iterative computation unlocks emergent reasoning capability.

Ablation Study¶

Configuration	Fixed-Point Convergence?	Upward Generalization?	Notes
AbbIE-D	Converges	Yes (350M at r=4)	Inter-iteration residual ensures convergence
AbbIE-C	Diverges	No	Intra-Body residuals alone are insufficient
0pt (Geiping et al.)	Converges	No (perplexity collapses at r≠2)	Converges but fails to generalize beyond training iterations

Key Findings¶

AbbIE-D is the only general-purpose recurrent Transformer that achieves upward generalization with only 2 training iterations, with ICL performance continuing to improve up to 4× the training iteration count.
Perplexity is approximately 5% lower than standard Transformers, following the same scaling law.
FLOP efficiency improves over the course of training: although AbbIE-D incurs slightly higher training FLOPs, the gap narrows in longer training runs.
Key finding: Even when perplexity slightly increases at \(r=8\), ICL task performance continues to improve — indicating that the relationship between perplexity and downstream task performance is not strictly monotonic.

Highlights & Insights¶

Equivalence to a standard Transformer at \(r=1\) is a highly desirable engineering property: models can first be trained in the standard manner and iterative inference can be enabled on demand, with zero adoption risk.
The 2-iteration training design elegantly balances training cost against inference capability. The contrast with 0pt — which requires many training iterations yet still fails to generalize — demonstrates that the key factor is architectural design (inter-iteration residual) rather than the training recipe.
The concept space theoretical framework provides a principled justification for the Head–Body–Tail partition and points toward future work on adaptive selection of partition boundaries.

Limitations & Future Work¶

Validation is limited to 350M models: whether the approach remains effective at 1B+ scale is unclear. The authors note that upward generalization is weaker for 200M models than for 350M, suggesting a critical model size threshold.
Inference latency scales linearly: \(r\) iterations imply an \(r\)-fold increase in inference latency (even though parameter count does not increase), which is unfavorable for latency-sensitive applications.
ICL gains are moderate: the largest improvement is approximately 12% (HellaSwag), with absolute performance remaining on par with standard models of equivalent scale.
Generation tasks are not evaluated: all evaluations are zero-shot ICL; generation tasks such as translation and summarization are not assessed.

vs. Coconut (latent reasoning): Both perform iteration in latent space, but Coconut requires specialized datasets and training pipelines, whereas AbbIE does not.
vs. 0pt (Geiping et al. 2025): Both are recurrent Transformers, but 0pt employs input concatenation with projection, while AbbIE-D uses residual connections. AbbIE incurs lower training cost (2 vs. multiple iterations) and achieves upward generalization.
vs. MoE: MoE reduces inference cost via sparse activation; AbbIE reduces memory cost via parameter sharing. The two approaches are complementary.

Rating¶

Novelty: ⭐⭐⭐⭐ The Head–Body–Tail partition combined with inter-iteration residuals is concise and effective, though the recurrent Transformer direction has accumulated considerable prior work.
Experimental Thoroughness: ⭐⭐⭐ Evaluation is limited to 350M models; generation task assessment is absent.
Writing Quality: ⭐⭐⭐⭐ Logical structure is clear; theoretical (Path Independence) and empirical contributions are well integrated.
Value: ⭐⭐⭐⭐ Proposes a practical recurrent Transformer alternative; the drop-in property is particularly attractive.