Nudging the Boundaries of LLM Reasoning¶
Conference: ICLR 2026 arXiv: 2509.25666 Code: GitHub Area: LLM Reasoning Keywords: Reinforcement Learning Reasoning, GRPO Improvement, Self-Generated Hints, Capability Upper Bound, Zone of Proximal Development
TL;DR¶
This paper identifies a fundamental limitation of GRPO: it cannot learn from problems that the model completely fails to solve (pass rate = 0%), producing zero gradients. The proposed method, NuRL, addresses this by injecting self-generated abstract hints (without revealing answers) into hard problems during training, converting them into learnable samples. NuRL consistently outperforms GRPO across 3 models and 6 benchmarks, and genuinely improves the pass@k capability upper bound.
Background & Motivation¶
- Core Limitation: Online RL (GRPO) produces zero gradients for problems that are completely unsolvable (all rollouts incorrect), preventing the model from learning from them.
- Distribution Sharpening Hypothesis: Growing evidence suggests that RL post-training primarily performs "distribution sharpening"—increasing the probability of already-known solutions—rather than discovering new reasoning capabilities.
- Unchanged pass@k: pass@k at large \(k\) often remains unchanged after RL training, indicating that the capability upper bound has not been surpassed.
- Zone of Proximal Development: Drawing an analogy to Vygotsky's Zone of Proximal Development—hard problems lie in the "learning zone" but cannot be entered without guidance.
- Value of Hard Problems: These "unsolvable" problems contain rich learning signals, exposing the model's weaknesses.
- Self-Sufficiency Requirement: A method is needed to push beyond capability boundaries without relying on external, stronger models.
Method¶
Overall Architecture¶
NuRL = Offline Hint Collection + Online Rollout Augmentation (two-stage training)
Key Designs¶
-
Offline Hint Collection:
- Input: (question \(q\), correct answer \(a\))
- Step 1: The model generates a CoT explaining "why the answer is correct": \(y = \pi_{old}(q, a; p_y)\)
- Step 2: An abstract high-level hint (core knowledge cue) is extracted from the CoT: \(h = \pi_\theta(q, a, y; p_h)\)
- Key Constraint: Hints must be abstract and high-level, containing neither the specific answer nor explicit solution steps.
-
Online Rollout Augmentation:
- During GRPO training, \(\mathcal{G}\) rollouts are generated per problem.
- If all rollouts fail (pass rate = 0%): the hint is appended to the problem.
- \(\mathcal{G}-1\) hint-augmented rollouts are regenerated, plus 1 rollout without a hint (to avoid zero variance when all are correct).
- Hints are not used at inference time—hints during training help the model internalize reasoning patterns.
-
Hint Type Exploration:
- Abstract cues (best) > Partial steps > Explanations > Direct answers (worst)
- Core finding: exposing more answer information leads to worse performance—consistent with human learning principles.
Loss & Training¶
- Stage 1: Standard GRPO training until training reward and validation accuracy converge.
- Stage 2: NuRL continues training—easy problems where all rollouts are correct in Stage 1 are filtered out.
Key Experimental Results¶
Main Results¶
| Model | Method | MATH500 | MATH Hard | AIME | GPQA | Date | Avg. |
|---|---|---|---|---|---|---|---|
| Llama-3B | GRPO | 56.92 | 30.11 | 8.33 | 27.98 | 57.10 | 35.87 |
| Llama-3B | NuRL(Self) | 58.04 | 31.62 | 9.17 | 28.28 | 61.65 | 37.49 |
| OctoThinker-3B | GRPO | 68.81 | 41.29 | 8.33 | 23.26 | 69.85 | 42.63 |
| OctoThinker-3B | NuRL(Self) | 70.13 | 42.07 | 9.66 | 27.15 | 71.75 | 44.38 |
Ablation Study¶
| Configuration | MATH500 | GPQA | Note |
|---|---|---|---|
| Hint from scratch + no trigger | 53.41 | 24.84 | Worst |
| Hint from scratch + full-failure trigger only | 56.06 | 27.63 | Trigger helps |
| Two-stage + no trigger | 53.09 | 26.62 | Two-stage also helps |
| Two-stage + full-failure trigger only (NuRL) | 58.04 | 28.28 | Best |
Key Findings¶
- NuRL improves pass@1024: GPQA 63.6%→69.7%, Date 86.4%→94.0%, while GRPO remains unchanged—demonstrating a genuine breakthrough in the capability upper bound.
- Teacher hints (GPT-o4-mini) yield a further +3.44% gain; self-generated hints are already effective.
- NuRL + Self-Consistency improves by 9.4% vs. GRPO + SC at only 7.8%—indicating stronger complementarity.
- The proportion of learnable problems increases from 66% to 70% during training, whereas standard GRPO remains at 66%.
Highlights & Insights¶
- Clearly and insightfully reveals the fundamental limitation of GRPO in learning from unsolvable problems.
- The analogy to Vygotsky's Zone of Proximal Development is accurate and apt, lending strong motivational grounding.
- "More abstract hints are better" is counterintuitive yet compelling—providing direct answers performs worst due to reward hacking.
- Self-generated hints require no external model, avoiding distribution shift and achieving full self-sufficiency.
- The two-stage strategy (GRPO convergence first, then NuRL) is concise and practically deployable.
Limitations & Future Work¶
- Performance gains are relatively modest (+1–2% on average), with limited improvement on stronger models (Qwen3-4B, +0.79%).
- The quality of self-generated hints is bounded by the model's own capability—extremely hard problems may not yield useful hints.
- The binary trigger (complete failure vs. partial success) for hint injection lacks a more fine-grained strategy.
- Offline hint collection requires gold answers, limiting applicability to answer-free settings.
- Hint quality evaluation and dynamic update mechanisms remain unexplored.
Related Work & Insights¶
- vs. GRPO/DAPO/Dr.GRPO: These methods improve advantage estimation, KL divergence, or sampling; NuRL is orthogonal, addressing the "unsolvable sample" problem directly.
- vs. STaR: STaR uses answer-conditioned reasoning; NuRL further abstracts this into hints that do not reveal the answer.
- vs. TBA: TBA generates diverse trajectories via multi-node search; NuRL reduces problem difficulty through hints.
Rating¶
- Novelty: ⭐⭐⭐⭐ — Insight into GRPO upper-bound limitations + self-generated hint framework
- Experimental Thoroughness: ⭐⭐⭐⭐ — 3 models, 6 benchmarks, multiple hint types, pass@k analysis
- Writing Quality: ⭐⭐⭐⭐⭐ — ZPD analogy is elegant; motivation → method → experiments flows smoothly
- Value: ⭐⭐⭐⭐ — Addresses a practical bottleneck in RL reasoning training with a concise, deployable approach