Skip to content

M-GRPO: Stabilizing Self-Supervised Reinforcement Learning for Large Language Models with Momentum-Anchored Policy Optimization

Conference: NeurIPS 2025 arXiv: 2512.13070 Code: GitHub Area: Self-Supervised Learning Keywords: self-supervised reinforcement learning, policy collapse, momentum anchoring, GRPO, entropy filtering

TL;DR

To address policy collapse and entropy collapse in self-supervised reinforcement learning for LLMs, this paper proposes M-GRPO, a momentum-anchored GRPO framework combined with an IQR-based low-entropy trajectory filtering method, achieving stable training and state-of-the-art performance.

Background & Motivation

Background: Reinforcement learning with verifiable rewards (RLVR) has become a core paradigm for post-training LLMs to enhance reasoning capabilities. However, it relies heavily on large-scale human-annotated data and reward model infrastructure, making it costly and domain-restricted. This has motivated exploration of self-supervised or label-free RL signals, such as self-consistency and self-certainty as reward signals (e.g., SRT, TTRL, Intuitor).

Limitations of Prior Work: Self-supervised RLVR (SS-RLVR) suffers from a critical policy collapse phenomenon during extended training — training rewards first increase and then sharply decline, accompanied by simultaneous degradation in validation accuracy. This is further compounded by entropy collapse, where the model becomes overconfident prematurely.

Key Challenge: Increasing the number of rollouts can only delay, not prevent, collapse. The self-reward mechanism is inherently unstable, lacking a stable training objective.

Goal: To stabilize the training process of SS-RLVR without relying on ground-truth labels, thereby preventing policy collapse and entropy collapse.

Key Insight: Drawing on the success of momentum encoders in contrastive learning (MoCo), this work introduces the momentum mechanism into policy optimization.

Core Idea: A slowly evolving momentum model provides stable pseudo-label training targets, while IQR-based filtering of low-entropy trajectories preserves policy diversity.

Method

Overall Architecture

M-GRPO consists of two key components: 1. Momentum-anchored self-supervised RL framework: A dual-model setup (current policy \(\pi_{\theta_q}\) + momentum model \(\pi_{\theta_k}\)) jointly generates rollouts and produces pseudo-labels via majority voting. 2. IQR adaptive entropy filter: Dynamically prunes low-entropy trajectories to prevent premature policy convergence.

Key Designs

  1. Momentum Model Update Mechanism:

    • The momentum model parameters \(\theta_k\) are not updated via backpropagation; instead, they are the exponential moving average of the current policy parameters: \(\pi_{\theta_k} \leftarrow m \cdot \pi_{\theta_k} + (1-m) \cdot \pi_{\theta_q}\)
    • \(m \in [0,1)\) is the momentum coefficient (e.g., 0.99), ensuring the momentum model evolves slowly and provides a stable reference.
    • Inspired by the MoCo series of contrastive learning, this work is the first to transfer the momentum-contrastive idea to RL policy optimization.
  2. Joint Rollout and Majority Voting:

    • The current policy generates \(M\) responses \(\{y_i^q\}_{i=1}^M\), and the momentum model generates \(N\) responses \(\{y_j^k\}_{j=1}^N\).
    • These are merged into a pool of size \(G = M + N\).
    • Majority voting selects the pseudo ground truth \(y_v\): the response with the highest answer consistency.
    • Including the momentum model's rollouts in the voting pool is critical — it reduces the noise inherent in pseudo-labels generated solely by the rapidly changing current policy.
  3. Normalized Advantage Estimation: Binary rewards are computed for the current policy's \(M\) rollouts based on the pseudo ground truth (consistent = 1, inconsistent = 0), and normalized per prompt following the GRPO framework: \(\hat{A}_i = \frac{r(y_v, y_i^q) - \text{mean}(\{r(y_v, y_j^q)\}_{j=1}^M)}{\text{std}(\{r(y_v, y_j^q)\}_{j=1}^M)}\)

  4. IQR Adaptive Entropy Filtering:

    • Trajectory-level entropy is computed for each of the \(G\) trajectories per input.
    • The interquartile range method is used to detect low-entropy outliers: \(T_{IQR} = Q_1 - k \cdot (Q_3 - Q_1)\), with \(k=0.75\).
    • Trajectories with entropy below the threshold are pruned.
    • Compared to static thresholds (e.g., removing the bottom 10%), the IQR method adaptively responds to dynamic changes in the entropy distribution throughout training.

Loss & Training

The learning objective is to maximize the advantage-weighted log-likelihood: $\(\mathcal{J}(\theta_q) = \mathbb{E}_{x \sim D, \{y_i^q\} \sim \pi_{\theta_q}}\left[\sum_{i=1}^M \hat{A}_i \log \pi_{\theta_q}(y_i^q | x)\right]\)$

Training details: - Backbone: Qwen3-4B-Base - Training data: MATH training set (no ground-truth labels) - Batch size: 8 questions, 32 rollouts per question, temperature 1.1 - Optimizer: AdamW (lr=1e-6, cosine warmup) - KL loss coefficient: 0.005 - Momentum model rollout count: \(N = G/4\)

Key Experimental Results

Main Results

M-GRPO performance on Qwen3-4B-Base (trained without ground-truth labels):

Method MATH500 AIME24 AIME25 GPQA Dia GPQA LiveCode
Base Model 61.50% 0.83% 5.00% 34.41% 29.91% 9.61%
SRT_Best (manually selected) 79.20% 12.50% 11.67% 38.26% 35.04% 19.69%
SRT_Final (post-collapse) 47.50% 7.50% 8.75% 28.54% 25.89% 16.12%
M-GRPO+IQR_Final 79.75% 14.58% 14.17% 39.65% 35.49% 27.12%

The final checkpoint of M-GRPO outperforms the best manually selected SRT checkpoint without any human intervention.

Ablation Study

Rollout scaling analysis (M-GRPO+IQR):

Config MATH500 AIME24 AIME25 GPQA Dia mbpp
G=8 77.60% 11.25% 10.42% 39.02% 68.60%
G=16 79.75% 14.43% 10.00% 39.65% 70.40%
G=32 79.75% 14.58% 14.17% 39.65% 70.60%
G=256 79.50% 16.67% 14.17% 40.66% 70.40%

Performance improves significantly from G=8 to G=32 and then saturates.

Key Findings

  • SRT eventually collapses under all rollout configurations; increasing rollouts only delays the inevitable.
  • M-GRPO maintains training stability across three model scales: Qwen3-1.7B, 4B, and 8B.
  • Gains of +5.05% on GPQA and +7.43% on LiveCode relative to SRT_Best.
  • The IQR filter effectively maintains higher policy entropy levels, with a slower and more gradual entropy decline.

Highlights & Insights

  • Clear problem diagnosis: The paper systematically reveals the phenomena of policy collapse and entropy collapse in SS-RLVR and their causal relationship.
  • Elegant transfer of the momentum mechanism: The momentum idea from MoCo in visual contrastive learning is transferred to RL policy stabilization with a clear and compelling analogy.
  • Strong practicality: No manual checkpoint selection is required; training is stable throughout, and the final model is the best model.
  • IQR adaptive filtering: More robust than fixed thresholds, dynamically adjusting to the entropy distribution as training progresses.

Limitations & Future Work

  • Validation is limited to the MATH dataset; generalization to other self-supervised RL scenarios (e.g., code generation, dialogue) has not been explored.
  • The impact of the momentum coefficient \(m\) on performance is not discussed in detail.
  • Experiments are conducted exclusively on the Qwen3 model family; generalizability to other model families remains unknown.
  • The dual-model architecture introduces approximately 25% additional inference overhead due to the momentum model's \(N\) rollouts.
  • Majority voting assumes that the correct answer is the most frequent one, which may fail on out-of-distribution or difficult samples.
  • SRT (Sheikh et al.): A self-training RL method serving as the direct baseline; this paper exposes its collapse problem.
  • GRPO/DAPO: Group relative policy optimization framework, upon which M-GRPO introduces the momentum mechanism.
  • MoCo (He et al., 2020): Momentum contrastive learning, the primary source of inspiration for M-GRPO.
  • The proposed approach offers insights for other RL scenarios requiring self-supervised signals, such as multimodal reasoning and tool-use learning.

Rating

  • Novelty: ⭐⭐⭐⭐ The combined design of momentum anchoring and IQR filtering is concise and effective, with in-depth problem analysis.
  • Experimental Thoroughness: ⭐⭐⭐⭐ Multi-benchmark, multi-scale validation with comprehensive scaling analysis.
  • Writing Quality: ⭐⭐⭐⭐ Problem formulation is clear and figures are intuitive.
  • Value: ⭐⭐⭐⭐⭐ Addresses a critical practical barrier in SS-RLVR, with significant implications for self-evolving LLM training.