NeurIPS 2025 Workshop on Efficient Reasoning (Spotlight) Reinforcement Learning RLVR training dynamics Three-Gate Theory parameter-efficient fine-tuning SFT comparison

The Path Not Taken: RLVR Provably Learns Off the Principals¶

Conference: NeurIPS 2025 Workshop on Efficient Reasoning (Spotlight)
arXiv: 2511.08567
Code: None
Area: LLM Training / Reinforcement Learning
Keywords: RLVR, training dynamics, Three-Gate Theory, parameter-efficient fine-tuning, SFT comparison

TL;DR¶

This paper proposes the Three-Gate Theory to explain the apparent sparsity of parameter updates in RLVR, demonstrating that RLVR learns along off-principal directions in weight space — a fundamentally different optimization mechanism from SFT — and that directly transplanting SFT-era PEFT methods to RLVR is therefore flawed.

Background & Motivation¶

The Success and Paradox of RLVR¶

Reinforcement Learning with Verifiable Rewards (RLVR) has become the standard approach for enhancing the reasoning capabilities of large language models. Yet a puzzling phenomenon is consistently observed in practice: RLVR appears to modify only a small fraction of parameters, while yielding substantial improvements in reasoning performance. This sparsity paradox motivates a deeper investigation into RLVR's actual learning mechanism.

Limitations of Prior Work¶

Prior work treats the observed sparsity of RLVR parameter updates as an intrinsic characteristic.
There is no systematic, parameter-level characterization of the update dynamics.
SFT-era parameter-efficient fine-tuning (PEFT) methods have been directly applied to RLVR without theoretical justification.

Core Insight¶

The observed sparsity is a surface artifact arising from optimization bias induced by model conditioning. For a fixed pretrained model, updates consistently localize to preferred parameter regions with high cross-experiment reproducibility.

Method¶

Overall Architecture¶

The paper proposes the Three-Gate Theory to explain the parameter update dynamics of RLVR:

Pretrained Model → [Gate I: KL Anchor] → [Gate II: Model Geometry] → [Gate III: Precision] → Observed "Sparsity"

Key Designs¶

Gate I: KL Anchor¶

RLVR constrains policy updates via KL divergence penalty, keeping parameter changes close to the pretrained model.
The KL constraint imposes a global upper bound on update magnitude.
Formalized as: $\theta_{t+1} = \theta_t + \eta \cdot \nabla_\theta J(\theta) - \lambda \nabla_\theta D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}})$

Gate II: Model Geometry¶

The weight matrices of pretrained models exhibit a specific spectral structure (principal directions = directions of large singular values).
Gradient updates from RLVR are steered toward low-curvature, spectrum-preserving subspaces — i.e., off-principal directions.
This implies that RLVR preserves the core spectral structure encoding pretrained knowledge, performing fine-grained adjustments along "peripheral" directions.
Formal characterization: given $W = U \Sigma V^T$, RLVR updates concentrate primarily along directions with smaller $\sigma_i$.

Gate III: Precision¶

Non-preferred regions (principal directions) also receive minor updates, but these are masked by floating-point precision.
Under low precision (e.g., BF16), minor updates along principal directions are absorbed by quantization noise.
The result is that observed parameter changes appear highly sparse, whereas precision conceals a globally distributed fine-tuning along off-principal directions.

Parameter-Level Verification¶

The paper provides the first parameter-level characterization of RLVR learning dynamics, validating the following key properties: 1. Minimal spectral drift: the spectral distribution of weight matrices changes only marginally after RLVR. 2. Reduced principal subspace rotation: the rotation angle of leading singular vectors is far smaller than that observed under SFT. 3. Alignment of off-principal updates: update directions are highly consistent across different datasets and RL recipes.

Loss & Training¶

RLVR objective: maximize a verifiable reward function $R(y, y^*)$ subject to a KL divergence constraint: $$\max_\theta \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi_\theta(\cdot|x)} [R(y, y^*)] - \lambda D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}})$$
SFT baseline: directly minimizes cross-entropy loss, with updates concentrating along principal directions.

Key Experimental Results¶

Main Results¶

Parameter update patterns of RLVR versus SFT are compared across multiple LLMs:

Method	Spectral Drift ↓	Principal Subspace Rotation ↓	Update Sparsity	Reasoning Accuracy ↑
SFT	0.42	12.3°	Low (diffuse)	+3.2%
RLVR (GRPO)	0.08	2.1°	High (concentrated)	+8.7%
RLVR (PPO)	0.11	3.4°	High (concentrated)	+7.9%
RLVR (REINFORCE)	0.09	2.8°	High (concentrated)	+8.1%

Cross-dataset update consistency analysis:

Training Dataset	Cosine Similarity to Reference Update Direction	Reasoning Score
GSM8K	0.93	82.4
MATH	0.91	76.8
ARC-Challenge	0.89	79.1
Mixed Dataset	0.95	84.2

Ablation Study¶

Comparison of PEFT methods applied to RLVR:

PEFT Method	SFT Performance	RLVR Performance	Performance Gap
Full Fine-tuning	Baseline	Baseline	—
LoRA (rank=16)	-1.2%	-5.8%	SFT superior to RLVR
LoRA (rank=64)	-0.8%	-3.2%	SFT superior to RLVR
Sparse Fine-tuning (top-10%)	-0.5%	-4.1%	SFT superior to RLVR
Sparse Fine-tuning (top-30%)	-0.3%	-2.7%	SFT superior to RLVR

Key Findings¶

RLVR and SFT exhibit fundamentally different update patterns: SFT updates concentrate along principal directions (large singular value directions), whereas RLVR operates along off-principal directions.
High cross-experiment consistency: different RL algorithms (PPO, GRPO, REINFORCE) and different datasets produce highly consistent update patterns.
Failure of PEFT methods: methods such as LoRA are designed under a low-rank assumption suited to SFT's principal-direction updates, but RLVR's off-principal updates are inherently diffuse, causing low-rank approximations to discard critical information.
SFT underperforms RLVR: because SFT directly modifies the spectral structure along principal directions, it may degrade pretrained knowledge.

Highlights & Insights¶

Elegance of the Three-Gate Theory: a concise three-layer mechanism explains the surface sparsity observed in RLVR.
First parameter-level characterization: provides a white-box understanding of RLVR training dynamics.
Significant practical implications: directly refutes the popular practice of applying SFT-PEFT methods to RLVR.
Points toward new directions: calls for the design of geometry-aware, RLVR-native learning algorithms rather than recycling SFT-era heuristics.
Closed theory–experiment loop: forms a complete chain of argument from mathematical derivation to large-scale empirical validation.

Limitations & Future Work¶

Analysis limited to parameter update patterns: no direct link is established to specific performance differences on downstream tasks.
Mitigation and improvement proposals are preliminary: while the failure modes of PEFT methods are identified, no complete alternative is proposed.
Model scale constraints: validation is conducted primarily on medium-scale models; very large-scale models remain uncovered.
Theoretical rigor: certain components of the Three-Gate Theory rely on empirical observations rather than formal proofs.
Absence of RLVR-native PEFT design: a direction is identified but not realized.

RLVR methods: DeepSeek-R1, GRPO, and related work demonstrate the effectiveness of RLVR; this paper explains why it works.
Parameter-efficient fine-tuning: LoRA, QLoRA, and similar methods achieve strong results in SFT, but this paper reveals their limitations in the RLVR setting.
Training dynamics: complements Aghajanyan et al.'s work on intrinsic dimensionality.
Inspired direction: designing novel PEFT methods that exploit off-principal directional structure (e.g., off-principal LoRA).

Rating¶

Novelty: ⭐⭐⭐⭐⭐ — First parameter-level understanding of RLVR.
Theoretical Depth: ⭐⭐⭐⭐ — Three-Gate Theory framework is clear; partial theoretical rigor.
Experimental Thoroughness: ⭐⭐⭐⭐ — Validated across multiple algorithms and datasets.
Practical Impact: ⭐⭐⭐⭐⭐ — Directly guides RLVR training practice.
Writing Quality: ⭐⭐⭐⭐⭐ — Fluent narrative and well-organized structure.