Training Large Reasoning Models Efficiently via Progressive Thought Encoding¶

Conference: ICLR2026 arXiv: 2602.16839 Code: Not open-sourced Area: LLM Reasoning Keywords: Large Reasoning Models, RL Training Efficiency, KV Cache Compression, Parameter-Efficient Fine-Tuning, Progressive Thought Encoding

TL;DR¶

This paper proposes Progressive Thought Encoding, which encodes evicted token information into fixed-size LoRA weight updates whenever KV cache entries are evicted, enabling efficient RL training of large reasoning models under constrained cache budgets while preserving long-range reasoning capability.

Background & Motivation¶

Large Reasoning Models (LRMs) require long rollouts during RL training to obtain outcome rewards, with autoregressive decoding dominating both time and memory overhead.
Sliding window cache strategies can bound memory usage, but discarding intermediate reasoning tokens disrupts long-range context comprehension and degrades reasoning quality.
Experiments confirm that applying sliding window caching during RL training of Qwen models yields noticeably lower performance than full-cache training.
Key Challenge: How to allow reasoning models to effectively "attend" to all preceding tokens under a fixed cache capacity.

Core Problem¶

How to train LRMs efficiently under a strict memory budget without sacrificing reasoning accuracy?

Method¶

Cache-aware GRPO Objective¶

The standard GRPO objective is reformulated into a cache-aware variant. At each step $t$, a cache policy $D$ selects a truncated context:

\[\pi_\theta^D(y|p) = \prod_{t=1}^T \pi_\theta(y_t | \mathcal{C}_t^D)\]

The objective becomes: $$\mathcal{L}_{\text{GRPO}}^D(\theta_g; \theta_{\text{ref}}) = \mathbb{E}_{y \sim \pi_{\theta_g}^D(\cdot|p)}\left[r(y) - \beta \text{KL}(\pi_{\theta_g}^D(\cdot|p) \| \pi_{\theta_{\text{ref}}}(\cdot|p))\right]$$

Progressive Thought Encoding¶

Core Idea: Rather than discarding evicted tokens, the method learns from them and updates lightweight parameters accordingly.

1. Context State Computation

When the cache is full, the keys and values of evicted tokens are $K_e, V_e$. A learnable global query $q_g$ is used to compute the context state:

\[S_e = \left((W_Q^a q_g)(W_K^a K_e)^T\right)(W_V^a V_e)\]

where $W_Q^a, W_K^a, W_V^a$ project the global query and evicted key-value pairs into a compressed latent space.

2. LoRA Weight Update

The context state is converted into a LoRA weight increment: $$\Delta W = A \cdot S_e \cdot B$$

Mapping matrices $A$ and $B$ inject the compressed contextual information into the model weights.

3. Progressive Update

Decoding continues under the updated policy $\pi_{\theta'}^D$, where $\theta' = \theta + \Delta W$.
Each time the cache is full, a new $S_e'$ is computed and accumulated: $S_e \leftarrow \text{Normalize}(S_e + S_e')$.
A learnable global token $h_g$ serves as the carrier for $q_g$ at initialization.

4. Cache Eviction Strategy

Question tokens are always retained in the cache (analogous to a sink-token mechanism).
Sliding window eviction is applied only to reasoning tokens.
When the cache is saturated, 25% of tokens are evicted.

Training and Inference¶

Training: The adapter is updated online as evicted tokens are continuously learned during rollout.
Inference: The learned adapter enables the model to maintain reasoning capability under a constrained cache.
Backpropagation through the full-cache rollout is not required.

Key Experimental Results¶

Main Results: Method Comparison (Maximum Generation Length 3072)¶

Model	Method	Peak GPU Mem	Math500	Olympiad	AMC	AIME24	AIME25	Avg
Qwen2.5-3B	Baseline	–	50.8	27.2	34.3	20.0	13.3	26.9
	LoRA	82.8%	53.2	27.8	35.9	20.0	16.7	28.2
	LoRA_c	38.0%	50.0	27.7	33.1	16.7	10.0	25.6
	Ours	45.3%	54.0	29.0	45.0	20.0	16.7	30.1
Qwen2.5-7B	Baseline	–	56.8	34.7	48.4	23.3	16.6	33.1
	LoRA	85.8%	59.4	38.7	50.6	30.0	26.7	38.1
	LoRA_c	63.1%	61.2	35.9	52.5	20.0	26.7	36.7
	Ours	67.2%	61.2	38.7	52.5	30.0	30.0	39.6
DS-R1-8B	Baseline	–	53.6	28.7	42.5	20.0	20.0	30.1
	LoRA	88.7%	57.4	35.3	55.0	23.3	20.0	34.9
	LoRA_c	59.1%	54.2	31.9	45.0	36.7	26.7	35.1
	Ours	59.8%	57.6	39.7	60.0	56.7	43.3	45.6

Key Numbers¶

Average gain of +15.5% on DS-R1-8B (30.1→45.6); AIME2024 +36.7%, AIME2025 +23.3%.
Peak GPU memory reduced from 88.7% to 59.8% (nearly 30 percentage points).
Computation (TFLOPs) reduced from 7.4 to 4.6 (38% reduction).
Average gain of +10.5% over naive cache truncation (LoRA_c).

Highlights & Insights¶

Turning cache eviction into a learning opportunity: Rather than discarding tokens outright, the method extracts and encodes their information as model weight updates.
Training–inference consistency: Both training and inference operate under constrained cache, eliminating the train–inference mismatch problem.
Substantial efficiency gains: ~50% memory reduction and ~38% compute reduction, with reasoning accuracy improving rather than degrading.
Remarkable gains on DS-R1-8B: Average accuracy improves by 15.5 percentage points, far surpassing full-cache LoRA.
Seamless integration with GRPO: The method can be directly embedded into existing RL training frameworks.

Limitations & Future Work¶

Validation is limited to mathematical reasoning; effectiveness on other reasoning tasks (code, science) remains unexplored.
The sliding window eviction strategy is relatively simple; importance-based eviction may be more effective but incurs higher computational overhead.
Sensitivity analysis of the global query $q_g$ dimensionality and the number of global tokens (32) is insufficiently detailed.
The LoRA rank is fixed at 32; the impact of rank selection on performance is not thoroughly analyzed.
Scalability of the 8×A100 training setup to larger models is uncertain.

Method	Mechanism	Memory Efficiency	Reasoning Accuracy	Applicable Scenario
Full-cache LoRA	Standard RL + LoRA	Low	Baseline	Small models / short sequences
Sliding window LoRA	Truncated cache	High	Below baseline	Constrained environments
TTT (test-time training)	Gradient updates at inference	Medium	Medium	Online adaptation
Ours	Evicted tokens encoded into weights	High	Exceeds baseline	Long-range reasoning RL

The intuition of "learning from what is discarded" bears conceptual resemblance to knowledge distillation — information from evicted tokens is not lost but compressed and retained. The approach has significant practical implications for inference-time compute optimization, particularly for deploying long-chain reasoning models on edge devices, and is compatible with other efficient inference methods such as speculative decoding and early stopping.

Rating¶

Novelty: ⭐⭐⭐⭐ — Converting cache eviction into online learning is an elegant idea.
Experimental Thoroughness: ⭐⭐⭐⭐ — Three models and six benchmarks with thorough comparisons.
Writing Quality: ⭐⭐⭐⭐ — Problem formulation is clear and method description is complete.
Value: ⭐⭐⭐⭐ — Substantially reduces RL training costs for LRMs with high practical utility.