Skip to content

Beyond Test-Time Memory: State-Space Optimal Control for LLM Reasoning

Conference: ICML 2026
arXiv: 2603.09221
Code: https://vita-group.github.io/TTC-Net (Project Page)
Area: LLM Reasoning
Keywords: Optimal Control, LQR, Test-Time Planning, State-Space Models, Mathematical Reasoning

TL;DR

LLM reasoning is modeled as an optimal control problem (Linear Quadratic Regulator, LQR) in latent space. The proposed Test-Time Control (TTC) layer performs finite-horizon planning during the forward pass and decodes optimal control actions as next-token representations. Combined with a symplectic iteration CUDA solver, it serves as an adapter for pre-trained LLMs, achieving gains of up to +27.8% on MATH-500 and a 2-3x Pass@8 improvement on AMC/AIME.

Background & Motivation

Background: Current mainstream sequence models (Transformers, SSMs, linear RNNs) share a core design principle—prediction based on associative memory. Attention retains the full KV cache for retrieval via query matching, while linear RNNs compress historical context into a fixed-size latent state for decoding. Both are essentially System 1-style fast pattern matching.

Limitations of Prior Work: Pure memory paradigms are limited in tasks requiring discovery, reasoning, and problem-solving. While Reinforcement Learning (RL) can make models more goal-oriented, RL serves only as an external training/post-training process and is absent during forward inference. Models learn "what to optimize" but do not learn "how to reason through planning" during the computation process.

Key Challenge: Memory architectures correspond to System 1 thinking, whereas System 2-style deliberation, multi-step planning, and long-range reasoning require specialized architectural support. RL training cannot break the reasoning ceiling imposed by memory architectures; planning capability remains external to the model.

Goal: Directly internalize planning into the model architecture, enabling LLMs to perform goal-oriented reasoning during the forward pass rather than relying on external training procedures.

Key Insight: The authors observe that LQR (Linear Quadratic Regulator) is an analytically solvable subclass of MDPs, and linear dynamical systems have been proven to express a wide family of MDPs. By modeling the next-token prediction of each layer as a differentiable finite-horizon LQR problem, planning can be performed natively during forward inference.

Core Idea: Replace pure memory retrieval with LQR planning from optimal control, allowing the model to "think about future trajectories" before prediction, thus realizing architectural System 2 reasoning.

Method

Overall Architecture

TTC-Net is a hybrid architecture: a TTC layer is inserted every 8 layers between the Attention and MLP modules of a pre-trained Transformer. Input token features are linearly projected to an initial latent state \(\boldsymbol{h}_0\). The TTC layer constructs a finite-horizon LQR problem on this state and solves for the optimal first-step action \(\boldsymbol{u}_1^*\) as the output. After normalization and linear projection, this is added back to the residual flow. The entire process is end-to-end differentiable, supporting training from scratch or fine-tuning on pre-trained models.

Key Designs

  1. Test-Time Control (TTC) Layer:

    • Function: Performs finite-horizon optimal control planning during the forward pass, decoding memory states into optimal decisions.
    • Mechanism: Given the initial state \(\boldsymbol{h}_0\) encoding the context, it constructs linear state transitions \(\boldsymbol{h}_t = \boldsymbol{A}_t \boldsymbol{h}_{t-1} + \boldsymbol{B}_t \boldsymbol{u}_t\) and a quadratic cost function \(\sum_{t=1}^{T}(\boldsymbol{h}_t^\top \boldsymbol{Q}_t \boldsymbol{h}_t + \boldsymbol{u}_t^\top \boldsymbol{R}_t \boldsymbol{u}_t)\). The optimal first-step action \(\boldsymbol{u}_1^* = \boldsymbol{K}_1^* \boldsymbol{h}_0\) is solved via Riccati iteration. LQR parameters are dynamically generated from the context \(\boldsymbol{h}_0\) via linear layers (contextualization) and parameterized with time-modulation coefficients \(\boldsymbol{\Gamma}_\Box^t\) to achieve time-heterogeneous dynamics and costs. Backpropagation is fully differentiable by solving the dual LQR via the KKT system.
    • Design Motivation: Existing memory layers (Attention/SSM) can only recall information from past context. The TTC layer optimizes future trajectories and minimizes long-range costs, endowing each sequence block with an intrinsic value function \(V_t(\boldsymbol{h}_t) = -\frac{1}{2}\boldsymbol{h}_t^\top \boldsymbol{P}_t \boldsymbol{h}_t\).
  2. Symplectic Iteration Solver:

    • Function: Replaces sequential matrix inversions in classical Riccati recursion with parallelizable matrix product chains, achieving over 10x throughput improvement.
    • Mechanism: Exploits the symplectic structure of LQR dynamics to reformulate the Riccati recursion as cumulative matrix products of symplectic matrices \(\boldsymbol{\Sigma}_t\). Matrix inversions \(\boldsymbol{A}_t^{-1}\) and \(\boldsymbol{R}_t^{-1}\) at each timestep are independent and fully parallelizable, leaving only matrix multiplications (充分利用 Tensor Cores). By diagonalizing \(\boldsymbol{A}_t\) and \(\boldsymbol{R}_t\), dense matrix inversion is reduced from \(O(T)\) to \(O(1)\). Further fused into a CUDA kernel with row-level tiling and SRAM streaming, with row normalization for numerical stability.
    • Design Motivation: Classical Riccati solvers require sequential backward iteration for \(T\) steps, each involving matrix inversion (\(O(Td^3)\)), which poorly matches GPU accelerators. Symplectic iteration shifts the bottleneck from inversion to multiplication.
  3. Hybrid Architecture and Test-time Scaling:

    • Function: Integrates the TTC layer as a lightweight adapter into pre-trained LLMs and supports flexible adjustment of the planning horizon during inference to enhance performance.
    • Mechanism: One TTC layer is inserted after every 8 Attention layers (8:1 interleaving ratio) using a multi-head structure (head size 16). During training, the planning horizon is sampled from a truncated Poisson log-normal distribution (mean \(T_\mu=8\), max 32) to avoid distribution shifts caused by fixed horizons. During testing, the planning horizon \(T_{test}\) can be arbitrarily increased; the model generalizes beyond the training maximum of 32 to \(T=64\) with sustained performance gains. The output projection \(\boldsymbol{W}_{out}\) is zero-initialized during fine-tuning to ensure the initial model matches the original backbone.
    • Design Motivation: TTC layers require rich memory states as input and must be interleaved with Attention. The mixed-horizon training strategy adapts the model to different planning depths, exposing an architecturally native test-time compute scaling axis.

Loss & Training

A mixed-horizon training strategy is used: in each iteration, the planning horizon \(T_{train}\) is sampled from a Poisson log-normal distribution with mean \(T_\mu = 8\), log-standard deviation \(T_\sigma = 0.1\), and a cap of 32. When fine-tuning on pre-trained models, the OpenThoughts2-114K dataset plus 800K self-collected reasoning samples are used for SFT, equivalent to imitation learning + inverse reinforcement learning.

Key Experimental Results

Main Results — Mathematical Reasoning (Finetuned on Llama-3-Instruct-7B)

Model MATH-500 AMC Acc@8 AMC Pass@8 AIME24 Acc@8 AIME24 Pass@8 AIME25 Pass@8
Base model 25.00 6.63 31.32 0.00 0.00 0.00
Full Finetuning 46.80 20.78 46.98 1.67 6.67 0.00
+ Attention 47.00 20.48 44.58 0.42 3.33 6.67
+ Mamba 44.80 22.29 44.58 0.83 3.33 3.33
+ GDN 47.80 17.77 37.35 0.42 3.33 6.67
+ MesaNet 47.40 12.65 27.71 1.25 10.00 0.00
Ours (TTC-Net) 52.80 23.34 54.22 3.33 20.00 20.00

Ablation Study — MATH-500

Configuration \(T_{test}=8\) \(T_{test}=16\) Description
Time-homogeneous parameterization 48.40 45.70 Removing time modulation; performance drops when increasing horizon
Fixed training horizon 50.60 31.50 Fails to generalize to larger test horizons
Uniformly sampled horizon 50.80 51.00 Similar effect but doubles training cost
Attn:TTC = 4:1 53.00 More TTC layers improve performance but increase computation
Attn:TTC = 16:1 47.20 Performance drops with too few TTC layers
Ours (PLN + 8:1) 52.80 53.60 Optimal balance, generalizes up to \(T=64\)

Highlights & Insights

  • Architectural Paradigm Shift: For the first time, reasoning is redefined from "memory retrieval" to "optimal control," providing an architectural implementation of System 2 cognition for LLM reasoning.
  • New Axis for Test-time Scaling: The planning horizon \(T\) provides a compute scaling axis orthogonal to the number of generated tokens; increasing \(T\) consistently improves reasoning accuracy without retraining.
  • Breaking Reasoning Ceilings: TTC-Net achieves a 0% to 20% Pass@8 breakthrough on AIME, indicating that control objectives provide an inductive bias that memory layers cannot reach.
  • Symplectic Iteration Solver: Achieves over 10x throughput gain via algorithm-hardware co-design, making optimal control layers practically usable in large-scale LLMs.

Limitations & Future Work

  • Theoretical understanding of the joint dynamical behavior of multi-layer TTC is lacking, and the interaction mechanism between layers is unclear.
  • Currently only validated on 7B models; the effects on larger models and full-stage training (Pre-training + RL) remain unknown.
  • The linear dynamics and quadratic costs of LQR still have expressivity limits; non-linear MDP formulations might provide further improvements.
  • The linear layers for parameter contextualization are relatively simple; more rich world model parameterizations are worth exploring.
  • Contrast with TTT (Test-Time Training) series: TTT is test-time memory (self-supervised regression), while TTC is test-time decision-making (optimal control).
  • Complementary to RL for LLMs (e.g., DeepSeek-R1): RL provides training-time objectives, while TTC internalizes these objectives into the architecture's forward pass.
  • The design of the symplectic iteration solver can be generalized to other scenarios requiring optimization layers embedded within neural networks.
  • Memory architectures like Titans and DeltaNet can be hybridized with TTC to explore richer memory-planning interactions.

Rating

  • Novelty: 9/10 — Paradigmatic innovation embedding optimal control as an architectural component in LLMs.
  • Experimental Thoroughness: 7/10 — Sufficiently validated on Sudoku and math reasoning, but limited to 7B models and lacks NLP/code tasks.
  • Writing Quality: 9/10 — Coherent narrative from cognitive science to control theory with rigorous mathematical derivation.
  • Value: 8/10 — Opens a new architectural direction for LLM reasoning, though large-scale validation is needed for practical adoption.