How Far Can Unsupervised RLVR Scale LLM Training?¶
Conference: ICLR 2026 arXiv: 2603.08660 Code: PRIME-RL/TTRL Area: Reinforcement Learning Keywords: unsupervised RLVR, model collapse, intrinsic rewards, sharpening mechanism, test-time training
TL;DR¶
This paper presents a comprehensive analysis of Unsupervised Reinforcement Learning from Verifiable Rewards (URLVR), demonstrating that all intrinsic reward methods fundamentally operate as a "sharpening" mechanism over the model's initial distribution, leading to an inevitable rise-then-fall collapse pattern. It proposes the Model Collapse Step as a prior-based model indicator and identifies external reward methods as the key direction for overcoming scalability bottlenecks.
Background & Motivation¶
RLVR (Reinforcement Learning from Verifiable Rewards) has been a core driver of recent breakthroughs in LLM reasoning capabilities (e.g., DeepSeek-R1, Gemini 2.5, Qwen3). However, supervised RLVR relies on high-quality annotated datasets, and as model capabilities approach or surpass human-level performance, obtaining reliable ground truth becomes increasingly difficult and costly — this constitutes the supervision bottleneck.
Unsupervised RLVR (URLVR) emerges as a response, aiming to provide reward signals without ground truth labels. Analogous to how pre-training scaling laws convert large amounts of unlabeled data into intelligence, URLVR seeks to extend this paradigm to the post-training stage.
However, existing URLVR methods (e.g., TTRL, RLIF, EM-RL) — while reporting initial performance gains — also exhibit reward hacking and model collapse. The lack of systematic comparison across different methods and settings raises a fundamental question: can intrinsic rewards truly scale LLM training?
Method¶
Overall Architecture¶
This paper establishes a taxonomy–theory–experiment unified analytical framework for URLVR:
- Taxonomy: Categorizes URLVR methods into Intrinsic Rewards and External Rewards
- Theoretical Analysis: Derives the "Sharpening Mechanism" underlying intrinsic rewards
- Systematic Experiments: Validates the rise-then-fall pattern and proposes the Model Collapse Step metric
Key Designs¶
-
Taxonomy of Intrinsic Reward Methods:
- Certainty-Based: Rewards derived from the model's own confidence (logits), including Self-Certainty (KL divergence from uniform distribution), Token-Level Entropy (negative entropy), Trajectory-Level Entropy (sequence log-probability), and Probability (product of probabilities)
- Ensemble-Based: Rewards based on consistency across multiple rollouts, e.g., Majority Voting in TTRL
-
Unified Reward Framework: All intrinsic rewards can be unified as \(r_{uni}(x,y) = \psi(\frac{\sigma}{|\mathcal{I}|} \sum_{i \in \mathcal{I}} \mathbb{H}(q_i, \pi_\theta^i))\), where \(\mathcal{I}\) denotes aggregation granularity, \(q\) is the anchor distribution, \(\sigma \in \{+1,-1\}\) is a sign factor, and \(\psi\) is a monotonic transformation. Different methods are simply distinct instantiations of these four components.
-
Theoretical Proof of the Sharpening Mechanism (using TTRL as an example):
- The optimal policy under the KL-regularized RL objective is \(\pi_\theta^*(y|x) \propto \pi_{ref}(y|x) \exp(\frac{1}{\beta} r(x,y))\)
- Under binary majority voting rewards, the probability mass on the majority answer is amplified by a factor of \(e^{1/\beta}\)
- Theorem 1: Under majority stability and effective learning assumptions, \(p_{maj}^{(k)}\) converges to 1 at a geometric rate \(\rho = e^{-1/\beta}\), and the policy converges to a deterministic distribution concentrated on the initial majority answer
Loss & Training¶
- Training uses the veRL framework with the GRPO algorithm
- Default configuration: temperature 1.0, batch size 64, 8 rollouts, no KL regularization
- Base model: Qwen3-1.7B-Base; training set: DAPO-17k
Key Experimental Results¶
Main Results: Rise-Then-Fall Pattern¶
Across systematic experiments spanning 5 intrinsic reward methods × 4 hyperparameter configurations:
| Method | Collapse Mode | Characteristics |
|---|---|---|
| Self-Certainty | Gradual degradation | Most stable; slowest degradation; Label Accuracy remains relatively high |
| Majority Voting | Gradual degradation | Operates at the answer level, avoiding token-level artifacts |
| Probability | Length collapse | Reward favors short sequences; model output length drops sharply |
| Token-Level Entropy | Repetition collapse | Minimizes entropy by repeating high-probability tokens |
| Trajectory-Level Entropy | Repetition collapse | Same as above; repetitive text fills generated sequences |
Ablation Study¶
| Configuration | Key Finding |
|---|---|
| Hyperparameter tuning | All settings eventually collapse; differences are in when, not whether collapse occurs |
| Single-question training | Only 3 of 25 questions (12%) flipped correctness; the rest merely sharpened existing preferences |
| Dataset size (32→16384) | ≤128 samples prevent collapse; ≥512 samples lead to inevitable collapse |
| Test-time vs. train-time | Test-time training on small datasets is safe and effective; train-time on large datasets inevitably collapses |
| 32 samples with all-wrong initial majority votes | Still improves OOD performance (AIME24/AMC23) |
Model Collapse Step¶
| Model | Collapse Step | GT Gain | Correlation |
|---|---|---|---|
| OLMo series (weak prior) | ~14–22 steps | Low | Strong positive |
| LLaMA series (moderate) | ~19–128 steps | Moderate | Strong positive |
| Qwen series (strong prior) | ~172–195 steps | High | Strong positive |
Computational cost comparison: Model Collapse Step requires only 1/5.6 the computation of GT Gain, and requires no ground truth labels.
External Rewards: Self-Verification Experiments¶
| Method | Verification Accuracy (Qwen3-1.7B) | Characteristics |
|---|---|---|
| Trajectory-Level Entropy | ~40% then collapse | Inherent scalability limitations |
| Self-Verification | ~65% and continuing to rise | Reward Accuracy initially drops then recovers |
| Oracle Supervision | ~70% | Upper bound |
Key Findings¶
- Sharpening nature of all intrinsic reward methods: Regardless of design differences, all methods converge toward the model's initial distribution
- Rise-then-fall is intrinsic, not an engineering issue: Even with optimal hyperparameter combinations, collapse occurs after ~1,000 steps (~4 epochs)
- Unique value of small datasets: ≤128 samples avoid collapse through local overfitting rather than global policy shift (KL divergence of only 0.057, vs. more than 2× for large datasets)
- Counter-intuitive OOD generalization: Models that answer all training questions incorrectly can still improve test set performance
- Model prior determines everything: Qwen > LLaMA; SFT > Base; smaller models are paradoxically more stable
- Instruction alignment is critical for Self-Verification: Instruction-aligned models start at 60%+ and are robust to both prompt types; base models are only effective with one prompt type
Highlights & Insights¶
- Unified theoretical framework: Unifies seemingly diverse intrinsic reward methods as different instantiations of cross-entropy manipulation, revealing their essential equivalence
- Theoretical proof of "sharpening" as the core mechanism, redefining the understanding of URLVR — it amplifies existing preferences rather than acquiring new knowledge
- Model Collapse Step: A simple, low-cost, GT-free metric for predicting RL trainability, with direct practical value for model selection
- Bridging theory and practice: A compelling correspondence between Theorem 1 and the empirically observed rise-then-fall pattern
Limitations & Future Work¶
- Experiments are primarily conducted on mathematical reasoning tasks; generalizability to other domains (code, open-ended dialogue, etc.) remains to be verified
- Self-Verification experiments are validated only on the simple Countdown task; more complex scenarios require further exploration
- Systematic investigation of external reward methods is relatively limited, constituting only "preliminary evidence"
- Theoretical assumptions (majority stability, effective learning) are not always satisfied in practice
- Effective combination of intrinsic and external rewards remains unexplored
Related Work & Insights¶
This paper spans multiple research directions, including RLVR (DeepSeek-R1, Qwen3, etc.), unsupervised/self-supervised learning, and test-time training. Specific methods such as TTRL, RLIF, and EM-RL are analyzed within the unified sharpening framework. The exploration of Self-Verification suggests a path from the confidence–correctness ceiling of intrinsic rewards toward external rewards — particularly directions that exploit the generation–verification asymmetry (e.g., formal verification, code execution), which hold significant promise.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ (Unified theoretical framework + Model Collapse Step + systematic experimental analysis)
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ (5 methods × 4 hyperparameter settings × multiple model families × multiple datasets)
- Writing Quality: ⭐⭐⭐⭐⭐ (Clear structure; complete logical chain from theory to experiment to practice)
- Value: ⭐⭐⭐⭐⭐ (Provides a clear roadmap and practical tools for the URLVR field)