Co-rewarding: Stable Self-supervised RL for Eliciting Reasoning in Large Language Models¶
Conference: ICLR 2026
arXiv: 2508.00410
Code: https://github.com/tmlr-group/Co-rewarding
Area: Reinforcement Learning
Keywords: Self-supervised RL, Label-free Reasoning, Training Collapse, GRPO, Contrastive Learning
TL;DR¶
Co-rewarding proposes a self-supervised RL framework that addresses training collapse in self-rewarding RL through two complementary supervision perspectives: a data-side mechanism (cross-view consistency via contrastive paraphrased questions) and a model-side mechanism (EMA teacher model providing pseudo-labels). Without any human annotations, the framework matches or surpasses RLVR (with ground-truth labels) across multiple mathematical reasoning benchmarks.
Background & Motivation¶
Background: RLVR (Reinforcement Learning with Verifiable Rewards, e.g., GRPO in DeepSeek-R1) is the dominant approach for enhancing LLM reasoning, but relies on human-annotated ground-truth answers as reward signals.
Limitations of Prior Work: - Ground-truth annotation is costly and difficult to scale, especially for complex tasks - Self-rewarding methods (self-certainty, entropy-based, majority voting) can substitute GT labels but frequently suffer from training collapse - Root cause of collapse: reward signals derived from the model's own single-perspective output form a "self-consistency illusion," enabling reward hacking
Key Challenge: Self-supervised signals are entangled with the current policy — the model achieves high rewards by minimizing entropy or maximizing consistency, but converges to trivial solutions (repeated strings, consistent yet incorrect answers).
Goal: - How to achieve stable RL training without GT labels? - How to break the self-consistency illusion arising from a single perspective? - Can the performance of GT-label-based RLVR be matched?
Key Insight: Inspired by self-supervised learning (SimCLR, BYOL, DINO) — genuine reasoning ability should manifest as invariance across views/time, rather than certainty of a single output.
Core Idea: Introduce complementary supervisory perspectives via data-side "paraphrase cross-validation" and model-side "EMA teacher pseudo-labels," increasing the difficulty of reward hacking and thereby preventing training collapse.
Method¶
Overall Architecture¶
Built upon GRPO optimization, the core innovation lies in modifying the reward source within advantage estimation: instead of GT labels or the model's own single-view output, pseudo-labels from an alternative perspective are introduced as cross-references. Three instantiations are proposed: Variant I (data-side), Variant II (model-side), and Variant III (combined).
Key Designs¶
-
Co-rewarding-I (Data-side: Contrastive Consistency):
- Function: Generates cross pseudo-labels for the original question and its semantically paraphrased version.
- Mechanism: Given original question \(x\) and its paraphrase \(x'\), \(G\) rollouts are sampled from the policy model for each. Majority voting over rollouts of \(x\) yields pseudo-label \(y_v\), and similarly \(y_v'\) for \(x'\). These labels are then cross-applied: \(y_v'\) evaluates rollouts of \(x\), and \(y_v\) evaluates rollouts of \(x'\). Advantage is computed as \(\hat{A}_i = \frac{r(y_v', y_i) - \text{mean}(...)}{\text{std}(...)}\).
- Design Motivation: Semantically equivalent paraphrases should yield identical answers (analogy-invariance). Cross-validation makes it difficult for the model to obtain high rewards by producing consistent but incorrect answers on a single input, since the paraphrase outputs "audit" the original answers.
-
Co-rewarding-II (Model-side: Self-distillation):
- Function: Provides pseudo-labels via an EMA-updated teacher model.
- Mechanism: Maintains an EMA teacher \(\tilde{\pi}_{ref}^{(k)} \leftarrow \alpha^{(k)} \tilde{\pi}_{ref}^{(k-1)} + (1-\alpha^{(k)}) \pi_{\theta_{old}}^{(k)}\), with the EMA coefficient annealed via cosine scheduling from \(\alpha_{start}\) to \(\alpha_{end}\). The teacher generates rollouts and produces pseudo-labels \(\tilde{y}_v\) via majority voting, which are then used to evaluate the policy model's rollouts.
- Design Motivation: The EMA-updated teacher is "temporally decoupled" from the online policy — its pseudo-labels are not immediately affected by policy updates, breaking the instantaneous feedback loop of self-rewarding. This is analogous to the momentum teacher in BYOL/DINO.
-
Co-rewarding-III (Combined):
- Function: Simultaneously employs paraphrased questions and the EMA teacher for full decoupling.
- Mechanism: The teacher model generates rollouts on paraphrased questions → produces pseudo-labels → supervises the policy model's rollouts on original questions (and vice versa). Complementary perspectives are introduced along both data and model dimensions.
- Design Motivation: Data-side and model-side perspectives are complementary — Variant I addresses single-perspective data issues, while Variant II addresses entanglement between supervision and policy.
Loss & Training¶
- Based on GRPO: \(\mathcal{J}(\theta) = \text{clipped surrogate objective} - \beta \cdot D_{KL}(\pi_\theta \| \pi_{ref})\)
- Reward function: Binary (correct = 1, incorrect = 0), evaluated using cross pseudo-labels
- EMA teacher update: Cosine annealing schedule — faster updates early (tracking policy improvement), slower updates later (stabilizing supervision)
- Paraphrase generation: Semantically equivalent paraphrases of original math problems are generated using an LLM
Key Experimental Results¶
Main Results (MATH training set, Qwen3-8B-Base)¶
| Method | MATH500 | GSM8K | AMC | IFEval | MMLU-Pro |
|---|---|---|---|---|---|
| Before RL | 72.4 | 27.8 | 20.9 | 50.9 | 52.9 |
| GT-Reward (RLVR) | 82.6 | 87.3 | 54.2 | 52.8 | 57.1 |
| Self-Certainty | 80.2 | 80.7 | 50.8 | 51.0 | 54.2 |
| Majority-Voting | 79.8 | 89.8 | 49.1 | 51.8 | 56.9 |
| Co-rewarding-I | 81.2 | 93.7 | 51.2 | 55.8 | 60.0 |
| Co-rewarding-II | 80.8 | 92.4 | 53.5 | 60.7 | 57.5 |
| Co-rewarding-III | 81.4 | 91.0 | 54.1 | 53.7 | 59.1 |
Ablation Study¶
| Configuration | Training Collapse? | Avg. Math Reasoning Gain |
|---|---|---|
| Self-Certainty | Frequent | +3% |
| Entropy | Occasional | +2% |
| Majority-Voting | Occasional | +4% |
| Co-rewarding | None | +7.49% (Llama-3.2-3B) |
Key Findings¶
- GSM8K 94.01%: Co-rewarding achieves 94.01% Pass@1 on GSM8K with Qwen3-8B, surpassing GT-label RLVR (87.26%) — the label-free approach outperforms the labeled one.
- Training Stability: All self-rewarding baselines exhibit collapse during training (sudden spikes in validation loss), while Co-rewarding training curves remain consistently stable.
- Average Gain of +3.31%: Across multiple mathematical reasoning benchmarks, Co-rewarding outperforms the best self-rewarding baseline by an average of 3.31%, reaching +7.49% on Llama-3.2-3B.
- Cross-task Transfer: Trained solely on MATH, the method yields significant improvements on Code (LiveCodeBench) and Instruction Following (IFEval).
- Complementary Variant Strengths: Variant I is strongest on GSM8K, Variant II on IFEval, and Variant III is the most balanced overall.
Highlights & Insights¶
- Elegant Transfer of Self-supervised Learning Philosophy: The conceptual transfer of SimCLR's "dual-view consistency" and BYOL/DINO's "momentum teacher" to LLM RL training is clear and precise. This suggests a broader methodological principle: successful paradigms from self-supervised learning can be systematically transferred to RL.
- Label-free Outperforming Labeled Training: The phenomenon of 94.01% vs. 87.26% (GT-Reward) on GSM8K warrants attention. A plausible explanation is that self-supervised signals encourage more diverse exploration, while binary GT rewards may over-constrain the policy.
- EMA Teacher + Paraphrase Cross-validation is Practically Scalable: Computational overhead is manageable (EMA requires no additional optimizer), paraphrases can be generated offline, and the overall approach scales more readily to unlabeled data than RLVR.
Limitations & Future Work¶
- Paraphrase quality affects the effectiveness of Co-rewarding-I, requiring high-quality paraphrase models.
- EMA teacher hyperparameters (\(\alpha_{start}, \alpha_{end}\)) require tuning.
- Validation is limited to mathematical reasoning; effectiveness on NLP reasoning, code generation, and other domains remains to be explored.
- Co-rewarding-III requires maintaining both the EMA teacher and paraphrase data, incurring higher memory overhead.
- Theoretical analysis is insufficient — formal guarantees explaining why cross-validation prevents collapse are lacking.
Related Work & Insights¶
- vs. Self-Certainty (Zhao et al.): Single-perspective certainty signals are prone to collapse; Co-rewarding introduces multi-perspective signals for stability.
- vs. RLVR (GT-Reward): RLVR's reliance on human annotation limits scalability; Co-rewarding is label-free yet matches or exceeds RLVR in multiple settings.
- vs. Majority-Voting (Shafayat et al.): Both use majority voting, but the latter applies it only within a single question — still a single-perspective approach; Co-rewarding introduces genuinely complementary perspectives via paraphrases or the teacher model.
Rating¶
- Novelty: ⭐⭐⭐⭐ The conceptual transfer from self-supervised learning to RL is elegant, though the individual techniques (paraphrasing + majority voting + EMA) are not entirely novel.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ Multiple model families (Qwen3/Llama), multiple benchmarks, training stability visualizations, and comprehensive ablations.
- Writing Quality: ⭐⭐⭐⭐ Concepts are clearly articulated and the progressive relationship among the three variants is well-structured, though the notation is dense.
- Value: ⭐⭐⭐⭐⭐ Training stability in label-free RL is a practical pain point; this paper presents a viable solution.