Reinforcement Learning Finetunes Small Subnetworks in Large Language Models¶

Conference: NeurIPS 2025 arXiv: 2505.11711 Code: GitHub Area: LLM Alignment Keywords: reinforcement-learning, sparse subnetwork, parameter update sparsity, LLM finetuning, lottery ticket hypothesis

TL;DR¶

RL fine-tuning of LLMs updates only 5%–30% of parameters in practice (sparse subnetworks), and these subnetworks exhibit high consistency across different random seeds, datasets, and algorithms. Fine-tuning only the identified subnetwork can reproduce both the performance and the parameter values of full fine-tuning.

Background & Motivation¶

Reinforcement learning (RL) is a critical stage in LLM post-training, used for improving reasoning capabilities and aligning models with human values.
The prevailing assumption holds that RL requires substantial modification of model parameters to achieve target behaviors, motivating the widespread adoption of full-parameter fine-tuning.
However, does RL actually update all parameters? This paper provides a negative answer.
This phenomenon emerges naturally without any explicit sparsity regularization, architectural constraints, or parameter-efficient training methods.
Prior work on the Lottery Ticket Hypothesis (LTH) focuses on performance recovery; this paper further demonstrates that nearly identical parameter values can also be recovered.

Method¶

Overall Architecture¶

Rather than proposing a new algorithm, this paper systematically analyzes the parameter update sparsity phenomenon in RL fine-tuning. The core pipeline is:

Measuring update sparsity: Comparing parameters before and after RL fine-tuning, defining \(\text{sparsity}(\theta^0, \theta^1) = 1 - \|\theta^1 - \theta^0\|_0 / n\)
Extracting subnetworks: Defining a binary mask \(m \in \{0,1\}^{|\theta|}\), where \(m_i = 1\) when \((\theta_{\text{init}} - \theta_{\text{full}})_i \neq 0\)
Subnetwork fine-tuning validation: Restricting gradient updates via \(m \odot \nabla_\theta \mathcal{L}(\theta)\), training only the subnetwork
Cross-condition consistency analysis: Comparing subnetwork overlap across different seeds, datasets, and algorithms

Key Designs¶

Definition and measurement of parameter update sparsity:

bfloat16 precision is used; absolute differences \(\leq 10^{-5}\) are treated as equal (consistent with PyTorch default tolerance).
Seven RL algorithms are covered: PPO, GRPO, DPO, ORPO, KTO, SimPO, and PRIME.
Ten LLMs from different model families are evaluated.

Subnetwork overlap metrics:

\[o_1 = \frac{|\mathcal{I}_1 \cap \mathcal{I}_2|}{|\mathcal{I}_1|}, \quad o_2 = \frac{|\mathcal{I}_1 \cap \mathcal{I}_2|}{|\mathcal{I}_2|}\]

where \(\mathcal{I}_1, \mathcal{I}_2\) are the index sets of updated parameters from two separate training runs.

Core Conjecture (Conjecture 1): Under the same data and hyperparameters, subnetwork fine-tuning yields \(\theta_{\text{sub}} \approx \theta_{\text{full}}\)—that is, not only performance but also parameter values are nearly identical.

Analysis of Update Sparsity Origins¶

The paper systematically investigates multiple potential contributing factors:

Factor	Impact
Gradient clipping	Limited (sparsity similar with/without clipping: 69.8% vs. 68.8%)
KL regularization	Limited (SimPO without KL remains sparse)
SFT pre-training	Not necessary (DeepSeek-R1-Zero without SFT still 86% sparse)
Training steps	Large impact early on; converges over time
In-distribution data training	Primary driver

Core finding: Training on in-distribution data is the primary driver of sparsity. Computing gradients over sequences already assigned high probability by the policy requires almost no parameter updates.

Key Experimental Results¶

Main Results: RL Update Sparsity¶

Algorithm	Model	Update Sparsity
DPO	Tulu-3-8B	81.4%
DPO	Tulu-3-70B	95.2%
GRPO	DeepSeek-Math-7B	68.5%
GRPO	DeepSeek-R1-Zero	86.0%
KTO	Eurus-7B	96.0%
PPO	Math-Shepherd-Mistral-7B	80.8%
SimPO	Llama-3-8B-SimPO	86.5%
PRIME	Eurus-2-7B	77.0%

68.5%–96.0% of parameters remain unchanged across all RL fine-tuned models. In contrast, SFT exhibits only 6%–15% sparsity.

Subnetwork Fine-Tuning Validation¶

Task	\(\theta_{\text{full}}\)	\(\theta_{\text{sub}}\)	Gain
AGIEval LSAT-AR (DPO)	21.3	24.8	+3.5
AGIEval LSAT-LR (DPO)	53.1	54.7	+1.6
MMLU Pro Math (DPO)	50.8	51.6	+0.8
MATH500 Overall (PRIME)	69.8	72.2	+2.4
MATH500 Lvl5 (PRIME)	40.3	45.5	+5.2

Subnetwork fine-tuning not only recovers the performance of the full model but outperforms full fine-tuning on all tasks. At tolerance \(10^{-4}\), \(\theta_{\text{full}}\) and \(\theta_{\text{sub}}\) are 100% identical.

Ablation Study: Cross-Condition Subnetwork Overlap¶

Varying Factor	Random Baseline	RL Subnetwork Overlap
Different seeds	36.7%	60.5%
Different data	14.6% / 36.7%	26.7% / 67.1%
Seeds + data + algorithm	23.0% / 12.9%	59.1% / 33.2%

Key Findings¶

The rank of update matrices is nearly full (99.2%–99.8%), indicating that RL updates are "sparse but full-rank."
Updates are not concentrated in specific layers—nearly all parameter matrices exhibit similarly sparse updates (except LayerNorm).
In PRIME, approximately 72% of parameters are never updated, 8% receive mutually canceling gradients, and 20% constitute the effective subnetwork.
In-distribution SFT (e.g., rejection sampling) also yields sparse updates (~90% sparsity), whereas out-of-distribution DPO produces only ~7% sparsity.

Highlights & Insights¶

Beyond LTH: Not only can subnetwork performance be reproduced, but parameter values are nearly identical—a stronger conclusion than the Lottery Ticket Hypothesis.
Sparse but full-rank: This stands in sharp contrast to the low-rank assumption underlying LoRA; RL updates select a small fraction of parameters yet span nearly the full column space of the parameter matrices.
In-distribution training is key: This provides a unified explanation for why both on-policy RL and off-policy RL following SFT produce sparse updates.
Practical implications: The findings provide a theoretical basis for efficient RL training—training only the subnetwork can substantially reduce computational cost.
Transferable structure in pretrained models: The high overlap of subnetworks across conditions suggests that pretrained models contain inherent substructures that are naturally amenable to RL fine-tuning.

Limitations & Future Work¶

Due to RL computational costs, only one factor is varied at a time, potentially overlooking interaction effects between factors.
Some experiments rely on publicly available checkpoints rather than fully controlled training runs.
The analysis is limited to language models; multimodal and diffusion models remain unexplored.
Methods for early identification of subnetworks during training are not investigated—how can subnetworks be discovered at the onset of training?
A rigorous theoretical analysis of the mathematical nature of update sparsity is lacking.
Certain counterexamples (e.g., prolonged RL training) exist but are not thoroughly discussed.

Challenge to LoRA: RL updates are sparse but full-rank, which differs from LoRA's low-rank assumption, suggesting that LoRA may not be the optimal strategy for RL fine-tuning.
Implications for efficient training: Early identification of subnetworks during training could substantially reduce the computational overhead of RL.
Understanding RL vs. SFT: The superior preservation of pretrained capabilities under RL compared to SFT may stem precisely from updating fewer parameters.
Cross-algorithm subnetwork reuse: A cheaper algorithm (e.g., DPO) could be used to identify subnetworks, which are then applied to more expensive algorithms (e.g., PPO).

Rating¶

Novelty: ⭐⭐⭐⭐⭐ First systematic revelation of parameter update sparsity in RL fine-tuning, with findings that surpass LTH
Experimental Thoroughness: ⭐⭐⭐⭐⭐ Covers 7 algorithms and 10 models with systematic ablation of contributing factors
Writing Quality: ⭐⭐⭐⭐ Clear structure and rigorous argumentation, though some evidence relies on publicly available checkpoints
Value: ⭐⭐⭐⭐⭐ Deep insights into the nature of RL fine-tuning with significant theoretical and practical implications