M-GRPO: Stabilizing Self-Supervised Reinforcement Learning for Large Language Models with Momentum-Anchored Policy Optimization¶
Conference: NeurIPS 2025 arXiv: 2512.13070 Code: GitHub Area: Self-Supervised Learning Keywords: self-supervised reinforcement learning, policy collapse, momentum anchoring, GRPO, entropy filtering
TL;DR¶
To address policy collapse and entropy collapse in self-supervised reinforcement learning for LLMs, this paper proposes M-GRPO, a momentum-anchored GRPO framework combined with an IQR-based low-entropy trajectory filtering method, achieving stable training and state-of-the-art performance.
Background & Motivation¶
Background: Reinforcement learning with verifiable rewards (RLVR) has become a core paradigm for post-training LLMs to enhance reasoning capabilities. However, it relies heavily on large-scale human-annotated data and reward model infrastructure, making it costly and domain-restricted. This has motivated exploration of self-supervised or label-free RL signals, such as self-consistency and self-certainty as reward signals (e.g., SRT, TTRL, Intuitor).
Limitations of Prior Work: Self-supervised RLVR (SS-RLVR) suffers from a critical policy collapse phenomenon during extended training — training rewards first increase and then sharply decline, accompanied by simultaneous degradation in validation accuracy. This is further compounded by entropy collapse, where the model becomes overconfident prematurely.
Key Challenge: Increasing the number of rollouts can only delay, not prevent, collapse. The self-reward mechanism is inherently unstable, lacking a stable training objective.
Goal: To stabilize the training process of SS-RLVR without relying on ground-truth labels, thereby preventing policy collapse and entropy collapse.
Key Insight: Drawing on the success of momentum encoders in contrastive learning (MoCo), this work introduces the momentum mechanism into policy optimization.
Core Idea: A slowly evolving momentum model provides stable pseudo-label training targets, while IQR-based filtering of low-entropy trajectories preserves policy diversity.
Method¶
Overall Architecture¶
M-GRPO consists of two key components: 1. Momentum-anchored self-supervised RL framework: A dual-model setup (current policy \(\pi_{\theta_q}\) + momentum model \(\pi_{\theta_k}\)) jointly generates rollouts and produces pseudo-labels via majority voting. 2. IQR adaptive entropy filter: Dynamically prunes low-entropy trajectories to prevent premature policy convergence.
Key Designs¶
-
Momentum Model Update Mechanism:
- The momentum model parameters \(\theta_k\) are not updated via backpropagation; instead, they are the exponential moving average of the current policy parameters: \(\pi_{\theta_k} \leftarrow m \cdot \pi_{\theta_k} + (1-m) \cdot \pi_{\theta_q}\)
- \(m \in [0,1)\) is the momentum coefficient (e.g., 0.99), ensuring the momentum model evolves slowly and provides a stable reference.
- Inspired by the MoCo series of contrastive learning, this work is the first to transfer the momentum-contrastive idea to RL policy optimization.
-
Joint Rollout and Majority Voting:
- The current policy generates \(M\) responses \(\{y_i^q\}_{i=1}^M\), and the momentum model generates \(N\) responses \(\{y_j^k\}_{j=1}^N\).
- These are merged into a pool of size \(G = M + N\).
- Majority voting selects the pseudo ground truth \(y_v\): the response with the highest answer consistency.
- Including the momentum model's rollouts in the voting pool is critical — it reduces the noise inherent in pseudo-labels generated solely by the rapidly changing current policy.
-
Normalized Advantage Estimation: Binary rewards are computed for the current policy's \(M\) rollouts based on the pseudo ground truth (consistent = 1, inconsistent = 0), and normalized per prompt following the GRPO framework: \(\hat{A}_i = \frac{r(y_v, y_i^q) - \text{mean}(\{r(y_v, y_j^q)\}_{j=1}^M)}{\text{std}(\{r(y_v, y_j^q)\}_{j=1}^M)}\)
-
IQR Adaptive Entropy Filtering:
- Trajectory-level entropy is computed for each of the \(G\) trajectories per input.
- The interquartile range method is used to detect low-entropy outliers: \(T_{IQR} = Q_1 - k \cdot (Q_3 - Q_1)\), with \(k=0.75\).
- Trajectories with entropy below the threshold are pruned.
- Compared to static thresholds (e.g., removing the bottom 10%), the IQR method adaptively responds to dynamic changes in the entropy distribution throughout training.
Loss & Training¶
The learning objective is to maximize the advantage-weighted log-likelihood: $\(\mathcal{J}(\theta_q) = \mathbb{E}_{x \sim D, \{y_i^q\} \sim \pi_{\theta_q}}\left[\sum_{i=1}^M \hat{A}_i \log \pi_{\theta_q}(y_i^q | x)\right]\)$
Training details: - Backbone: Qwen3-4B-Base - Training data: MATH training set (no ground-truth labels) - Batch size: 8 questions, 32 rollouts per question, temperature 1.1 - Optimizer: AdamW (lr=1e-6, cosine warmup) - KL loss coefficient: 0.005 - Momentum model rollout count: \(N = G/4\)
Key Experimental Results¶
Main Results¶
M-GRPO performance on Qwen3-4B-Base (trained without ground-truth labels):
| Method | MATH500 | AIME24 | AIME25 | GPQA Dia | GPQA | LiveCode |
|---|---|---|---|---|---|---|
| Base Model | 61.50% | 0.83% | 5.00% | 34.41% | 29.91% | 9.61% |
| SRT_Best (manually selected) | 79.20% | 12.50% | 11.67% | 38.26% | 35.04% | 19.69% |
| SRT_Final (post-collapse) | 47.50% | 7.50% | 8.75% | 28.54% | 25.89% | 16.12% |
| M-GRPO+IQR_Final | 79.75% | 14.58% | 14.17% | 39.65% | 35.49% | 27.12% |
The final checkpoint of M-GRPO outperforms the best manually selected SRT checkpoint without any human intervention.
Ablation Study¶
Rollout scaling analysis (M-GRPO+IQR):
| Config | MATH500 | AIME24 | AIME25 | GPQA Dia | mbpp |
|---|---|---|---|---|---|
| G=8 | 77.60% | 11.25% | 10.42% | 39.02% | 68.60% |
| G=16 | 79.75% | 14.43% | 10.00% | 39.65% | 70.40% |
| G=32 | 79.75% | 14.58% | 14.17% | 39.65% | 70.60% |
| G=256 | 79.50% | 16.67% | 14.17% | 40.66% | 70.40% |
Performance improves significantly from G=8 to G=32 and then saturates.
Key Findings¶
- SRT eventually collapses under all rollout configurations; increasing rollouts only delays the inevitable.
- M-GRPO maintains training stability across three model scales: Qwen3-1.7B, 4B, and 8B.
- Gains of +5.05% on GPQA and +7.43% on LiveCode relative to SRT_Best.
- The IQR filter effectively maintains higher policy entropy levels, with a slower and more gradual entropy decline.
Highlights & Insights¶
- Clear problem diagnosis: The paper systematically reveals the phenomena of policy collapse and entropy collapse in SS-RLVR and their causal relationship.
- Elegant transfer of the momentum mechanism: The momentum idea from MoCo in visual contrastive learning is transferred to RL policy stabilization with a clear and compelling analogy.
- Strong practicality: No manual checkpoint selection is required; training is stable throughout, and the final model is the best model.
- IQR adaptive filtering: More robust than fixed thresholds, dynamically adjusting to the entropy distribution as training progresses.
Limitations & Future Work¶
- Validation is limited to the MATH dataset; generalization to other self-supervised RL scenarios (e.g., code generation, dialogue) has not been explored.
- The impact of the momentum coefficient \(m\) on performance is not discussed in detail.
- Experiments are conducted exclusively on the Qwen3 model family; generalizability to other model families remains unknown.
- The dual-model architecture introduces approximately 25% additional inference overhead due to the momentum model's \(N\) rollouts.
- Majority voting assumes that the correct answer is the most frequent one, which may fail on out-of-distribution or difficult samples.
Related Work & Insights¶
- SRT (Sheikh et al.): A self-training RL method serving as the direct baseline; this paper exposes its collapse problem.
- GRPO/DAPO: Group relative policy optimization framework, upon which M-GRPO introduces the momentum mechanism.
- MoCo (He et al., 2020): Momentum contrastive learning, the primary source of inspiration for M-GRPO.
- The proposed approach offers insights for other RL scenarios requiring self-supervised signals, such as multimodal reasoning and tool-use learning.
Rating¶
- Novelty: ⭐⭐⭐⭐ The combined design of momentum anchoring and IQR filtering is concise and effective, with in-depth problem analysis.
- Experimental Thoroughness: ⭐⭐⭐⭐ Multi-benchmark, multi-scale validation with comprehensive scaling analysis.
- Writing Quality: ⭐⭐⭐⭐ Problem formulation is clear and figures are intuitive.
- Value: ⭐⭐⭐⭐⭐ Addresses a critical practical barrier in SS-RLVR, with significant implications for self-evolving LLM training.