Nudging the Boundaries of LLM Reasoning¶

Conference: ICLR 2026 arXiv: 2509.25666 Code: GitHub Area: LLM Reasoning Keywords: Reinforcement Learning Reasoning, GRPO Improvement, Self-Generated Hints, Capability Upper Bound, Zone of Proximal Development

TL;DR¶

This paper identifies a fundamental limitation of GRPO: it cannot learn from problems that the model completely fails to solve (pass rate = 0%), producing zero gradients. The proposed method, NuRL, addresses this by injecting self-generated abstract hints (without revealing answers) into hard problems during training, converting them into learnable samples. NuRL consistently outperforms GRPO across 3 models and 6 benchmarks, and genuinely improves the pass@k capability upper bound.

Background & Motivation¶

Core Limitation: Online RL (GRPO) produces zero gradients for problems that are completely unsolvable (all rollouts incorrect), preventing the model from learning from them.
Distribution Sharpening Hypothesis: Growing evidence suggests that RL post-training primarily performs "distribution sharpening"—increasing the probability of already-known solutions—rather than discovering new reasoning capabilities.
Unchanged pass@k: pass@k at large \(k\) often remains unchanged after RL training, indicating that the capability upper bound has not been surpassed.
Zone of Proximal Development: Drawing an analogy to Vygotsky's Zone of Proximal Development—hard problems lie in the "learning zone" but cannot be entered without guidance.
Value of Hard Problems: These "unsolvable" problems contain rich learning signals, exposing the model's weaknesses.
Self-Sufficiency Requirement: A method is needed to push beyond capability boundaries without relying on external, stronger models.

Method¶

Overall Architecture¶

NuRL = Offline Hint Collection + Online Rollout Augmentation (two-stage training)

Key Designs¶

Offline Hint Collection:
- Input: (question \(q\), correct answer \(a\))
- Step 1: The model generates a CoT explaining "why the answer is correct": \(y = \pi_{old}(q, a; p_y)\)
- Step 2: An abstract high-level hint (core knowledge cue) is extracted from the CoT: \(h = \pi_\theta(q, a, y; p_h)\)
- Key Constraint: Hints must be abstract and high-level, containing neither the specific answer nor explicit solution steps.
Online Rollout Augmentation:
- During GRPO training, \(\mathcal{G}\) rollouts are generated per problem.
- If all rollouts fail (pass rate = 0%): the hint is appended to the problem.
- \(\mathcal{G}-1\) hint-augmented rollouts are regenerated, plus 1 rollout without a hint (to avoid zero variance when all are correct).
- Hints are not used at inference time—hints during training help the model internalize reasoning patterns.
Hint Type Exploration:
- Abstract cues (best) > Partial steps > Explanations > Direct answers (worst)
- Core finding: exposing more answer information leads to worse performance—consistent with human learning principles.

Loss & Training¶

Stage 1: Standard GRPO training until training reward and validation accuracy converge.
Stage 2: NuRL continues training—easy problems where all rollouts are correct in Stage 1 are filtered out.

Key Experimental Results¶

Main Results¶

Model	Method	MATH500	MATH Hard	AIME	GPQA	Date	Avg.
Llama-3B	GRPO	56.92	30.11	8.33	27.98	57.10	35.87
Llama-3B	NuRL(Self)	58.04	31.62	9.17	28.28	61.65	37.49
OctoThinker-3B	GRPO	68.81	41.29	8.33	23.26	69.85	42.63
OctoThinker-3B	NuRL(Self)	70.13	42.07	9.66	27.15	71.75	44.38

Ablation Study¶

Configuration	MATH500	GPQA	Note
Hint from scratch + no trigger	53.41	24.84	Worst
Hint from scratch + full-failure trigger only	56.06	27.63	Trigger helps
Two-stage + no trigger	53.09	26.62	Two-stage also helps
Two-stage + full-failure trigger only (NuRL)	58.04	28.28	Best

Key Findings¶

NuRL improves pass@1024: GPQA 63.6%→69.7%, Date 86.4%→94.0%, while GRPO remains unchanged—demonstrating a genuine breakthrough in the capability upper bound.
Teacher hints (GPT-o4-mini) yield a further +3.44% gain; self-generated hints are already effective.
NuRL + Self-Consistency improves by 9.4% vs. GRPO + SC at only 7.8%—indicating stronger complementarity.
The proportion of learnable problems increases from 66% to 70% during training, whereas standard GRPO remains at 66%.

Highlights & Insights¶

Clearly and insightfully reveals the fundamental limitation of GRPO in learning from unsolvable problems.
The analogy to Vygotsky's Zone of Proximal Development is accurate and apt, lending strong motivational grounding.
"More abstract hints are better" is counterintuitive yet compelling—providing direct answers performs worst due to reward hacking.
Self-generated hints require no external model, avoiding distribution shift and achieving full self-sufficiency.
The two-stage strategy (GRPO convergence first, then NuRL) is concise and practically deployable.

Limitations & Future Work¶

Performance gains are relatively modest (+1–2% on average), with limited improvement on stronger models (Qwen3-4B, +0.79%).
The quality of self-generated hints is bounded by the model's own capability—extremely hard problems may not yield useful hints.
The binary trigger (complete failure vs. partial success) for hint injection lacks a more fine-grained strategy.
Offline hint collection requires gold answers, limiting applicability to answer-free settings.
Hint quality evaluation and dynamic update mechanisms remain unexplored.

vs. GRPO/DAPO/Dr.GRPO: These methods improve advantage estimation, KL divergence, or sampling; NuRL is orthogonal, addressing the "unsolvable sample" problem directly.
vs. STaR: STaR uses answer-conditioned reasoning; NuRL further abstracts this into hints that do not reveal the answer.
vs. TBA: TBA generates diverse trajectories via multi-node search; NuRL reduces problem difficulty through hints.

Rating¶

Novelty: ⭐⭐⭐⭐ — Insight into GRPO upper-bound limitations + self-generated hint framework
Experimental Thoroughness: ⭐⭐⭐⭐ — 3 models, 6 benchmarks, multiple hint types, pass@k analysis
Writing Quality: ⭐⭐⭐⭐⭐ — ZPD analogy is elegant; motivation → method → experiments flows smoothly
Value: ⭐⭐⭐⭐ — Addresses a practical bottleneck in RL reasoning training with a concise, deployable approach