Scaf-GRPO: Scaffolded Group Relative Policy Optimization for Enhancing LLM Reasoning¶

Conference: ICLR 2026 arXiv: 2510.19807 Code: None Area: Optimization / LLM Reasoning Enhancement Keywords: GRPO, reinforcement learning, learning cliff, progressive guidance, scaffolding pedagogy

TL;DR¶

This paper proposes the Scaf-GRPO framework, which injects hierarchical in-prompt hints (Knowledge → Planning → Solution) to overcome the "learning cliff" (zero-reward) problem in GRPO training. On Qwen2.5-Math-7B, it achieves a 44.3% relative improvement in pass@1 on AIME24 while preserving on-policy training consistency.

Background & Motivation¶

Background: Reinforcement learning from verifiable rewards (RLVR) has become the dominant paradigm for enhancing LLM reasoning. Algorithms such as GRPO update policies using advantage signals derived from group-relative rewards.

Limitations of Prior Work: When a model encounters problems far beyond its current capability, all exploratory attempts fail, yielding persistent zero-reward signals. In GRPO, all-zero rewards within a group cause the advantage \(\hat{A}_i = \frac{R(o_i) - \mu_\mathcal{G}}{\sigma_\mathcal{G}} = 0\), leading to vanishing gradients and forming a "learning cliff."

Key Challenge: Existing solutions such as LUFFY adopt a prefix-continuation strategy—supplying the model with a correct solution prefix—which introduces a distribution mismatch between the teacher and student policies and forces the model along a predetermined path, suppressing exploration.

Goal: To help the model overcome the learning cliff and acquire reasoning capabilities from otherwise unsolvable problems, without introducing off-policy distribution mismatch.

Key Insight: Inspired by the pedagogical theory of scaffolding, the approach provides minimal, progressive in-prompt hints rather than imposing a forced solution-path prefix.

Core Idea: Rather than providing "rails" (prefixes), the method provides "signposts" (hints)—injecting hierarchical prompts so that the model generates correct solutions using its own policy, thereby avoiding off-policy issues while retaining exploratory freedom.

Method¶

Overall Architecture¶

Training proceeds in two phases. Phase 1 (guidance exemption period, first 15% of steps) allows the model to explore autonomously and distinguish "pseudo-hard" from "truly hard" problems. Phase 2 activates hierarchical hint-guided exploration for truly hard problems. When all rollouts in a batch yield zero reward, Scaf-GRPO injects hints in the order Knowledge → Planning → Solution until the model produces a correct solution. The successful trajectory replaces one failed trajectory, advantages are recomputed, and the policy is updated with the standard GRPO loss.

Key Designs¶

Guidance Exemption Period and Truly Hard Problem Diagnosis:
- Function: No hints are provided during the initial training phase (first 15% of steps), allowing fully autonomous exploration.
- Mechanism: The resolution rate of zero-reward problems is monitored; once the rate stagnates, remaining unsolved problems are labeled as "truly hard." The rapid early decline corresponds to "pseudo-hard" problems (unfamiliar formatting, elementary reasoning skills).
- Design Motivation: Prevents premature hint dependency and ensures hints are reserved for genuine capability gaps. Ablation experiments show that removing the exemption period causes a 9.2% performance drop.
Hierarchical Hint-Guided Exploration (K→P→S):
- Function: Three levels of progressive in-prompt hints, from abstract to concrete, are injected for truly hard problems.
- Mechanism: \(H_{\text{knowledge}}\) (key concepts/formulas) → \(H_{\text{planning}}\) (high-level strategy framework) → \(H_{\text{solution}}\) (concrete computational steps). Hints are provided incrementally within each level; the process halts as soon as the model succeeds, and the minimum effective hint level is recorded.
- Design Motivation: Minimal intervention preserves model autonomy—rewarding successful problem-solving with the most abstract hint encourages internalization of reasoning skills rather than memorization. Removing any single level degrades performance (removing the Solution level causes a 5.7% drop).
On-Policy Batch Augmentation and Unified Loss:
- Function: A successful hint-guided trajectory replaces one failed trajectory, restoring the advantage signal.
- Mechanism: \(\mathcal{G}_{\text{final}} = (\mathcal{G} \setminus \{o_j\}) \cup \{o_h^*\}\), where \(o_h^* \sim \pi_\theta(\cdot | q \oplus h^*)\). The probability ratio \(r_{i,t}'(\theta) = \frac{\pi_\theta(o_{i,t}'|o_{i,<t}', q \oplus h^*)}{\pi_{\theta_{\text{old}}}(o_{i,t}'|o_{i,<t}', q \oplus h^*)}\) is a standard on-policy ratio.
- Design Motivation: Unlike prefix-based methods that use the off-policy ratio \(\frac{\pi_\theta(\cdot|q)}{\pi_{\theta_{\text{old}}}(\cdot|q \oplus h^*)}\), this approach conditions both policies on the same hint-augmented prompt, guaranteeing on-policy consistency.

Loss & Training¶

The loss function is identical to standard GRPO (clipped surrogate objective), with differences confined to the data level: \(J_{\text{Scaf-GRPO}}(\theta) = \hat{\mathbb{E}}_{i,t}[\min(r_{i,t}'(\theta)\hat{A}_i', \text{clip}(r_{i,t}'(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_i')]\). The KL divergence penalty is set to 0 to maximize exploration. Training runs for 10 epochs with a maximum response length of 2048 tokens.

Key Experimental Results¶

Main Results¶

Model / Benchmark	Metric	Scaf-GRPO	Vanilla GRPO	LUFFY	Gain
Qwen2.5-Math-7B / AIME24	pass@1	43.3	30.0	33.3	+44.3% vs GRPO
Qwen2.5-Math-7B / AIME25	pass@1	20.0	13.3	16.7	+50.4% vs GRPO
Qwen2.5-Math-7B / AMC	pass@1	70.0	60.0	62.5	+16.7% vs GRPO
Qwen2.5-Math-7B / Avg. 7 benchmarks	pass@1	50.9	45.2	46.6	+12.6% vs GRPO
Qwen2.5-Math-1.5B / Average	pass@1	41.5	37.6	—	+10.4%
DeepSeek-R1-Distill-1.5B / Average	pass@1	53.6	50.6	—	+5.9%

Ablation Study¶

Configuration	Avg. 7 Benchmarks	Note
Full K→P→S	50.9	Complete three-level hierarchy
w/o Progressive (Solution-Only)	48.4	Directly provide most concrete hint
w/o Knowledge Hint	49.2	Remove concept level
w/o Solution Hint	48.0	Remove concrete step level; largest drop
w/o Incremental Chunking	47.7	Full hint provided at once
No Guidance (Vanilla GRPO)	45.2	Unguidanced baseline

Key Findings¶

Progressive guidance outperforms directly providing Solution hints by 2.5 points—abstract hints compel the model to reason autonomously, cultivating more generalizable skills.
Removing any single hint level degrades performance, indicating the three levels are complementary rather than redundant.
Incremental delivery (revealing hint content step by step) outperforms one-shot delivery by 3.2 points, validating the principle of minimal intervention.
Models exhibit an observable evolution from "imitating hints" to "solving problems autonomously."

Highlights & Insights¶

The "signposts vs. rails" analogy is apt: in-prompt hints allow the model to freely choose its reasoning path, whereas prefix-continuation forces a predetermined route.
The guidance exemption period is an elegant design choice—allowing the model to struggle independently before intervening, akin to effective pedagogical practice.
The GRPO loss function is preserved in its entirety; intervention occurs only at the data level, resulting in an engineering solution that is both clean and elegant.

Limitations & Future Work¶

The three-level hints must be pre-generated by a strong external model (DeepSeek-R1), increasing data preparation costs.
Validation is currently limited to mathematical reasoning tasks; transferability to domains such as code generation or logical reasoning remains unexplored.
Hint quality has a significant impact (a 4% gap between DeepSeek-R1 and Qwen-72B), making dependence on hint generation a potential bottleneck.
While the guidance exemption period percentage (15%) is stable in the 10%–40% range, the optimal value may vary across models.

vs. LUFFY: LUFFY's prefix-continuation approach introduces distribution mismatch requiring policy shaping corrections; Scaf-GRPO uses in-prompt hints to maintain on-policy training, outperforming LUFFY by an average of 4.3 points on 7B models.
vs. Vanilla GRPO: GRPO learning stalls due to zero gradients under zero reward; Scaf-GRPO restores the signal through batch augmentation.
vs. DAPO/DeepScaleR: These methods modify the GRPO algorithm itself, whereas Scaf-GRPO modifies the data and guidance strategy; the two approaches are orthogonal and can be combined.

Rating¶

Novelty: ⭐⭐⭐⭐ The application of scaffolding pedagogy to reinforcement learning is novel; in-prompt hints as a distinct alternative to prefix-continuation represent a key contribution.
Experimental Thoroughness: ⭐⭐⭐⭐ Covers multiple models (Qwen/Llama/DeepSeek) and scales (1.5B–7B) with carefully designed ablations.
Writing Quality: ⭐⭐⭐⭐ Motivation is clearly articulated; the training dynamics visualization in Figure 2 is particularly intuitive.
Value: ⭐⭐⭐⭐ Provides a practical and theoretically grounded solution to the learning cliff problem in RLVR.