PENCIL: Long Thoughts with Short Memory¶
Conference: ICML2025
arXiv: 2503.14337
Code: chr26195/PENCIL
Area: LLM Reasoning
Keywords: Chain-of-Thought, Memory Management, Context Compression, Turing Completeness, Reduction Rule, Space Efficiency
TL;DR¶
Proposed PENCIL (PENCIL ENables Context-efficient Inference and Learning), which introduces a functional call stack-inspired reduction rule during autoregressive generation to recursively clear completed intermediate reasoning steps, enabling LLMs to solve computationally hard problems requiring exponential context using only polynomial context length.
Background & Motivation¶
- Fundamental bottleneck of CoT: Once the intermediate reasoning steps generated by standard Chain-of-Thought are written into the context, they permanently occupy space (write-only), causing the context length to grow linearly with the number of reasoning steps.
- For fundamentally hard reasoning tasks (such as NP-complete or PSPACE-complete problems), the number of reasoning steps may grow exponentially, causing the CoT context to explode.
- Long contexts also degrade the model's ability to retrieve information (the "lost in the middle" phenomenon).
- Inspiration from computer systems: Turing machines can overwrite tapes to reclaim space, and high-level programming languages feature stack frame deallocation and garbage collection mechanisms—whereas LLMs lack corresponding memory reclamation mechanisms.
- Core Problem: Can LLMs be enabled to automatically discard no-longer-needed intermediate computations during generation to achieve deeper reasoning within a finite context window?
Method¶
Overall Architecture¶
PENCIL = next-token generator \(f\) + reduction rule \(\phi\)
After generating each token, the model immediately attempts to apply the reduction rule, alternating between generation and reduction:
Reduction Rule (Mechanism)¶
Three special tokens are introduced: [CALL], [SEP], and [RETURN], defining the reduction pattern:
- C (Context): The context, which can contain any special tokens.
- T (Thoughts): Intermediate reasoning steps that are discarded upon completion.
- A (Answer): The reasoning result, merged back into the context after reduction.
The matching rule guarantees uniqueness: [RETURN] is the last functional token in the sequence, [SEP] is the one closest to it, and [CALL] is the one closest to [SEP].
Tail Recursion Optimization¶
When \(\mathbf{A} = \texttt{[CALL]}~\mathbf{T'}\), the reduction becomes:
Similar to tail recursion optimization in functional programming, this iteratively simplifies complex problems (\(\mathbf{T}\)) into simpler forms (\(\mathbf{T'}\)).
Space Efficiency Analysis¶
- CoT context length = \(n + k\) (continuously accumulating)
- PENCIL maximum context length = \(\max_i |x^{(i)}|\) (shrunk after each reduction)
- For problems like SAT/QBF: CoT space \(O(2^n)\) \(\rightarrow\) PENCIL space \(O(n)\)
Training Method¶
- Data Preparation: Compute-run algorithms generate scaffolded CoT (complete reasoning chains with special tokens), which are then split at reduction points into a set of short sequences \(\{x^{(1)}, x^{(2)}, \ldots, x^{(r+1)}\}\).
- Loss Function: Loss is computed only on the newly generated part of each short sequence (\(x^{(i)} \backslash x^{(i-0.5)}\)), avoiding redundant calculations on the context retained from the previous reduction turn.
- All short sequences can be packed into a single batch, reusing the KV cache.
Key Experimental Results¶
SAT Problem (3-SAT, clause-to-variable ratio = 4.3)¶
| Variables \(n\) | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|---|---|---|---|---|---|---|---|---|
| Baseline Acc | 66 | 57 | 46 | 51 | 46 | 51 | 49 | 51 |
| CoT Acc | 100 | 100 | 100 | 99 | 84 | 63 | 54 | 50 |
| PENCIL Acc | 100 | 100 | 100 | 99 | 99 | 100 | 100 | 100 |
CoT sharply degrades to random performance when \(n \geq 7\), while PENCIL consistently remains close to perfect.
QBF Problem (PSPACE-complete)¶
| Variables \(n\) | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|---|---|---|---|---|---|---|---|---|
| CoT Acc | 100 | 100 | 97 | 94 | 74 | 72 | 69 | 73 |
| PENCIL Acc | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
PENCIL achieves 100% accuracy on all scales, and the trace rate also remains at 100%.
Einstein's Puzzle¶
| Puzzle Size | CoT Acc | PENCIL Acc |
|---|---|---|
| 3×3 | 99% | 99% |
| 4×4 | 34% | 100% |
| 5×5 | 25% (≈ random) | 97% |
- The 5×5 Einstein puzzle is a task that even GPT-4 struggles to solve.
- PENCIL achieves 97% accuracy using only 25M parameters + 2048 context.
- The maximum sequence length is reduced from 151,192 to 3,335 (45\(\times\) compression).
Space Compression Effect¶
| Task | \(n=10\) CoT Length | \(n=10\) PENCIL Length | Compression Ratio |
|---|---|---|---|
| SAT | 13,804 | 2,507 | 5.5× |
| QBF | 151,661 | 649 | 233× |
Ablation Study¶
- Model Size: PENCIL works stably with \(\geq\) 3.15M parameters (4-layer Transformer), whereas CoT requires a larger model and longer context.
- Convergence Speed: Under the same FLOP budget, PENCIL consistently converges faster than CoT, with the gap widening as the problem scale increases.
- Tail Recursion: For the 5×5 Einstein puzzle, the maximum length without tail recursion is 7,705, which reduces to 3,335 with tail recursion.
Highlights & Insights¶
- Brilliant Analogy: Analogizes LLM inference memory management to a function call stack, where
[CALL]/[SEP]/[RETURN]directly correspond to function calling conventions. - Theoretical Breakthrough (Theorem 5.1): Proves that PENCIL can simulate any Turing machine with \(O(T)\) generated tokens and \(O(S)\) maximum context length—marking the first work to prove that Transformers can achieve general, space-efficient computation.
- Far-reaching Corollary: Under polynomial context, PENCIL can solve all PSPACE problems, whereas standard CoT can only solve P-class problems.
- Small Model, Big Capability: A 25M parameter model solves the Einstein puzzle where GPT-4 fails, demonstrating the leverage effect of this methodology.
- Unified Training and Inference: The reduction rule is naturally consistent during both training and inference, requiring no extra decoding tricks.
Limitations & Future Work¶
- Reliance on Algorithmic Priors: The training data requires pre-designed reasoning paths with reduction markers (scaffolded CoT), meaning the model cannot autonomously discover reduction strategies.
- Validation Limited to Synthetic/Formal Tasks: SAT, QBF, and Einstein puzzle are highly structured tasks; the generalizability to natural language reasoning remains to be verified.
- Deterministic Inference: Currently, PENCIL uses deterministic generation (greedy decoding), and its performance under random sampling or beam search has not been explored.
- Special Token Overhead: Introducing three special tokens adds overhead to the vocabulary, and the model must precisely learn when to generate these tokens.
- General LLM Adaptation: The fine-tuning effects on pre-trained large language models (such as LLaMA or GPT series) have not been verified.
Related Work & Insights¶
- CoT Theory: Merrill & Sabharwal (2023) proved that CoT can theoretically perform general computation, but with a space efficiency of \(O(T)\). This work improves it to \(O(S)\).
- External Memory: Gao et al. (2023) and Wang et al. (2024) augment LLMs with external memory, but lack systematic space reclamation mechanisms.
- Functional Programming: The reduction rule directly originates from the reduction semantics of \(\lambda\)-calculus, establishing a deep connection between LLM reasoning and programming language theory.
- Analogy to Normalizing Flows: The alternating generation-reduction process of PENCIL resembles the forward-inverse transformations in flow models.
- Inspirational Directions: Extending the reduction rules to more flexible pattern matching, and enabling models to learn reduction strategies autonomously via reinforcement learning.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ — First to introduce memory management based on function call stacks into LLM reasoning with rigorous theoretical guarantees.
- Experimental Thoroughness: ⭐⭐⭐⭐ — Three task categories plus ablation studies and theoretical validation, though still limited to formal problems.
- Writing Quality: ⭐⭐⭐⭐⭐ — Clear concepts, elegant illustrations, and a smooth flow between theory and experiments.
- Value: ⭐⭐⭐⭐⭐ — Fundamentally shifts the paradigm of "LLM reasoning = infinite context" and opens up the research direction of space-efficient reasoning.