Skip to content

PENCIL: Long Thoughts with Short Memory

Conference: ICML2025
arXiv: 2503.14337
Code: chr26195/PENCIL
Area: LLM Reasoning
Keywords: Chain-of-Thought, Memory Management, Context Compression, Turing Completeness, Reduction Rule, Space Efficiency

TL;DR

Proposed PENCIL (PENCIL ENables Context-efficient Inference and Learning), which introduces a functional call stack-inspired reduction rule during autoregressive generation to recursively clear completed intermediate reasoning steps, enabling LLMs to solve computationally hard problems requiring exponential context using only polynomial context length.

Background & Motivation

  • Fundamental bottleneck of CoT: Once the intermediate reasoning steps generated by standard Chain-of-Thought are written into the context, they permanently occupy space (write-only), causing the context length to grow linearly with the number of reasoning steps.
  • For fundamentally hard reasoning tasks (such as NP-complete or PSPACE-complete problems), the number of reasoning steps may grow exponentially, causing the CoT context to explode.
  • Long contexts also degrade the model's ability to retrieve information (the "lost in the middle" phenomenon).
  • Inspiration from computer systems: Turing machines can overwrite tapes to reclaim space, and high-level programming languages feature stack frame deallocation and garbage collection mechanisms—whereas LLMs lack corresponding memory reclamation mechanisms.
  • Core Problem: Can LLMs be enabled to automatically discard no-longer-needed intermediate computations during generation to achieve deeper reasoning within a finite context window?

Method

Overall Architecture

PENCIL = next-token generator \(f\) + reduction rule \(\phi\)

After generating each token, the model immediately attempts to apply the reduction rule, alternating between generation and reduction:

\[\text{PENCIL}_{\phi,f}^k = (\phi \circ f)^k\]

Reduction Rule (Mechanism)

Three special tokens are introduced: [CALL], [SEP], and [RETURN], defining the reduction pattern:

\[\mathbf{C}~\texttt{[CALL]}~\mathbf{T}~\texttt{[SEP]}~\mathbf{A}~\texttt{[RETURN]} \Rightarrow \mathbf{C}~\mathbf{A}\]
  • C (Context): The context, which can contain any special tokens.
  • T (Thoughts): Intermediate reasoning steps that are discarded upon completion.
  • A (Answer): The reasoning result, merged back into the context after reduction.

The matching rule guarantees uniqueness: [RETURN] is the last functional token in the sequence, [SEP] is the one closest to it, and [CALL] is the one closest to [SEP].

Tail Recursion Optimization

When \(\mathbf{A} = \texttt{[CALL]}~\mathbf{T'}\), the reduction becomes:

\[\mathbf{C}~\texttt{[CALL]}~\mathbf{T}~\texttt{[SEP]}~\texttt{[CALL]}~\mathbf{T'}~\texttt{[RETURN]} \Rightarrow \mathbf{C}~\texttt{[CALL]}~\mathbf{T'}\]

Similar to tail recursion optimization in functional programming, this iteratively simplifies complex problems (\(\mathbf{T}\)) into simpler forms (\(\mathbf{T'}\)).

Space Efficiency Analysis

  • CoT context length = \(n + k\) (continuously accumulating)
  • PENCIL maximum context length = \(\max_i |x^{(i)}|\) (shrunk after each reduction)
  • For problems like SAT/QBF: CoT space \(O(2^n)\) \(\rightarrow\) PENCIL space \(O(n)\)

Training Method

  • Data Preparation: Compute-run algorithms generate scaffolded CoT (complete reasoning chains with special tokens), which are then split at reduction points into a set of short sequences \(\{x^{(1)}, x^{(2)}, \ldots, x^{(r+1)}\}\).
  • Loss Function: Loss is computed only on the newly generated part of each short sequence (\(x^{(i)} \backslash x^{(i-0.5)}\)), avoiding redundant calculations on the context retained from the previous reduction turn.
  • All short sequences can be packed into a single batch, reusing the KV cache.

Key Experimental Results

SAT Problem (3-SAT, clause-to-variable ratio = 4.3)

Variables \(n\) 3 4 5 6 7 8 9 10
Baseline Acc 66 57 46 51 46 51 49 51
CoT Acc 100 100 100 99 84 63 54 50
PENCIL Acc 100 100 100 99 99 100 100 100

CoT sharply degrades to random performance when \(n \geq 7\), while PENCIL consistently remains close to perfect.

QBF Problem (PSPACE-complete)

Variables \(n\) 3 4 5 6 7 8 9 10
CoT Acc 100 100 97 94 74 72 69 73
PENCIL Acc 100 100 100 100 100 100 100 100

PENCIL achieves 100% accuracy on all scales, and the trace rate also remains at 100%.

Einstein's Puzzle

Puzzle Size CoT Acc PENCIL Acc
3×3 99% 99%
4×4 34% 100%
5×5 25% (≈ random) 97%
  • The 5×5 Einstein puzzle is a task that even GPT-4 struggles to solve.
  • PENCIL achieves 97% accuracy using only 25M parameters + 2048 context.
  • The maximum sequence length is reduced from 151,192 to 3,335 (45\(\times\) compression).

Space Compression Effect

Task \(n=10\) CoT Length \(n=10\) PENCIL Length Compression Ratio
SAT 13,804 2,507 5.5×
QBF 151,661 649 233×

Ablation Study

  • Model Size: PENCIL works stably with \(\geq\) 3.15M parameters (4-layer Transformer), whereas CoT requires a larger model and longer context.
  • Convergence Speed: Under the same FLOP budget, PENCIL consistently converges faster than CoT, with the gap widening as the problem scale increases.
  • Tail Recursion: For the 5×5 Einstein puzzle, the maximum length without tail recursion is 7,705, which reduces to 3,335 with tail recursion.

Highlights & Insights

  1. Brilliant Analogy: Analogizes LLM inference memory management to a function call stack, where [CALL]/[SEP]/[RETURN] directly correspond to function calling conventions.
  2. Theoretical Breakthrough (Theorem 5.1): Proves that PENCIL can simulate any Turing machine with \(O(T)\) generated tokens and \(O(S)\) maximum context length—marking the first work to prove that Transformers can achieve general, space-efficient computation.
  3. Far-reaching Corollary: Under polynomial context, PENCIL can solve all PSPACE problems, whereas standard CoT can only solve P-class problems.
  4. Small Model, Big Capability: A 25M parameter model solves the Einstein puzzle where GPT-4 fails, demonstrating the leverage effect of this methodology.
  5. Unified Training and Inference: The reduction rule is naturally consistent during both training and inference, requiring no extra decoding tricks.

Limitations & Future Work

  1. Reliance on Algorithmic Priors: The training data requires pre-designed reasoning paths with reduction markers (scaffolded CoT), meaning the model cannot autonomously discover reduction strategies.
  2. Validation Limited to Synthetic/Formal Tasks: SAT, QBF, and Einstein puzzle are highly structured tasks; the generalizability to natural language reasoning remains to be verified.
  3. Deterministic Inference: Currently, PENCIL uses deterministic generation (greedy decoding), and its performance under random sampling or beam search has not been explored.
  4. Special Token Overhead: Introducing three special tokens adds overhead to the vocabulary, and the model must precisely learn when to generate these tokens.
  5. General LLM Adaptation: The fine-tuning effects on pre-trained large language models (such as LLaMA or GPT series) have not been verified.
  • CoT Theory: Merrill & Sabharwal (2023) proved that CoT can theoretically perform general computation, but with a space efficiency of \(O(T)\). This work improves it to \(O(S)\).
  • External Memory: Gao et al. (2023) and Wang et al. (2024) augment LLMs with external memory, but lack systematic space reclamation mechanisms.
  • Functional Programming: The reduction rule directly originates from the reduction semantics of \(\lambda\)-calculus, establishing a deep connection between LLM reasoning and programming language theory.
  • Analogy to Normalizing Flows: The alternating generation-reduction process of PENCIL resembles the forward-inverse transformations in flow models.
  • Inspirational Directions: Extending the reduction rules to more flexible pattern matching, and enabling models to learn reduction strategies autonomously via reinforcement learning.

Rating

  • Novelty: ⭐⭐⭐⭐⭐ — First to introduce memory management based on function call stacks into LLM reasoning with rigorous theoretical guarantees.
  • Experimental Thoroughness: ⭐⭐⭐⭐ — Three task categories plus ablation studies and theoretical validation, though still limited to formal problems.
  • Writing Quality: ⭐⭐⭐⭐⭐ — Clear concepts, elegant illustrations, and a smooth flow between theory and experiments.
  • Value: ⭐⭐⭐⭐⭐ — Fundamentally shifts the paradigm of "LLM reasoning = infinite context" and opens up the research direction of space-efficient reasoning.