Slow-Fast Policy Optimization: Reposition-Before-Update for LLM Reasoning¶
Conference: ICLR 2026 arXiv: 2510.04072 Code: slow-fast-po.github.io Area: LLM Alignment Keywords: Reinforcement Learning, GRPO, Policy Optimization, Mathematical Reasoning, Sample Efficiency
TL;DR¶
This paper proposes SFPO (Slow-Fast Policy Optimization), which decomposes each training step into a three-stage structure of "fast trajectory — reposition — slow correction." Without modifying the objective function or rollout procedure, SFPO serves as a plug-and-play enhancement to GRPO, achieving up to 2.80-point average improvement on mathematical reasoning benchmarks and up to 4.93× reduction in rollouts.
Background & Motivation¶
- Reinforcement learning (RL) has become a core technique for improving LLM reasoning, with GRPO being a widely adopted critic-free policy gradient method.
- Limitations of GRPO:
- Poor rollout quality in early training leads to high-variance gradients due to random rewards, causing unstable updates.
- Each batch of rollouts supports only a single update step (one-shot), underutilizing available gradient information.
- Naively reusing rollout data introduces off-policy bias, which can degrade performance in later training.
- A mechanism is needed to stabilize gradient directions, improve sample utilization, and control distributional shift simultaneously.
Method¶
Overall Architecture¶
SFPO decomposes each training iteration into three coordinated stages:
- Stage I: Fast Trajectory — Perform \(K\) inner-loop update steps on the same batch of rollouts.
- Stage II: Reposition — Interpolate back toward the starting point to control off-policy drift.
- Stage III: Slow Correction — Perform one additional gradient update at the interpolated point.
Stage I: Fast Trajectory¶
Starting from parameters \(\theta^{s,0}\), perform \(K\) inner-loop update steps:
- Unlike single-step updates, the displacement \(\theta^{s,K} - \theta^{s,0} = -\eta \sum_{k=0}^{K-1} \nabla_\theta \mathcal{L}(\theta^{s,k})\) accumulates the effect of \(K\) gradients.
- Under a second-order approximation, this is equivalent to a curvature-aware low-pass filter: making steady progress along flat directions while automatically suppressing oscillations along high-curvature directions.
Stage II: Reposition¶
Inspired by the Lookahead Optimizer, the endpoint of the fast trajectory is interpolated back toward the starting point:
- \(\alpha\) controls the degree of off-policy drift: a smaller \(\alpha\) implies stronger proximal regularization.
- This is equivalent to solving a linearized proximal subproblem centered at \(\theta^{s,0}\), with \(\alpha\) acting as an implicit trust-region radius.
Stage III: Slow Correction¶
One additional gradient update is performed at the interpolated point:
This forms a predictor-corrector structure, ensuring the final update is aligned with local curvature.
Unified Update Formula¶
Adaptive \(\alpha\) Scheduling¶
- Policy entropy \(H_s\) is monitored via a rolling z-score \(Z_s = (H_s - \mu_s) / \sigma_s\).
- When \(|Z_s| \geq \tau\), \(\alpha \to 0\) is triggered, reducing SFPO to standard GRPO.
- The fast trajectory accelerates convergence in early training, while pure on-policy updates maintain stability in later stages.
Loss & Training¶
SFPO does not modify the underlying loss function and directly uses the GRPO objective:
Key Experimental Results¶
Main Results: Mathematical Reasoning Benchmarks (DAPO+Math Training Set)¶
| Model | Method | Math-500 | AIME24 | AIME25 | AMC | Minerva | Olympiad | Avg |
|---|---|---|---|---|---|---|---|---|
| Qwen2.5-Math-1.5B | GRPO | 77.15 | 16.67 | 11.67 | 53.31 | 31.89 | 39.42 | 38.35 |
| SFPO | 78.35 | 20.00 | 15.00 | 56.02 | 32.07 | 39.72 | 40.19 | |
| DS-Qwen-1.5B | GRPO | 84.65 | 30.00 | 23.33 | 66.86 | 31.71 | 49.85 | 47.73 |
| SFPO | 86.10 | 32.50 | 30.83 | 70.28 | 32.81 | 50.67 | 50.53 | |
| DS-Qwen-7B | GRPO | 91.70 | 50.00 | 35.83 | 80.42 | 43.65 | 61.24 | 60.47 |
| SFPO | 92.60 | 54.17 | 37.50 | 83.75 | 44.49 | 65.73 | 63.04 |
Ablation Study: Efficiency Analysis¶
| Model | Rollout Reduction | Training Time Reduction |
|---|---|---|
| DS-Qwen-1.5B | 3.21× | 2.62× |
| Qwen3-4B-Base | 3.50× | 2.65× |
| DS-Qwen-7B | 4.93× | 4.19× |
Key Findings¶
- SFPO consistently outperforms GRPO across all 5 models and 6 benchmarks, with the largest gains on smaller models (+2.80 on DS-Qwen-1.5B).
- Training dynamics analysis shows that SFPO avoids the response length collapse observed in GRPO.
- SFPO introduces no additional GPU memory overhead, as no extra optimizer states need to be stored.
- Consistent gains are maintained on the larger Skywork-OR1 training set (105K samples).
Highlights & Insights¶
- Plug-and-play design: The loss function, rollout generation, and regularization are left entirely unchanged; SFPO directly replaces the update step of GRPO.
- Clear theoretical intuition: Fast trajectory = curvature-aware low-pass filter; reposition = implicit trust region; slow correction = local curvature alignment.
- Adaptive exit mechanism: The entropy-monitored \(\alpha\) schedule automatically degrades to GRPO during convergence, balancing efficiency and stability.
- Substantial sample efficiency gains: Up to 4.93× fewer rollouts are required to reach equivalent accuracy.
Limitations & Future Work¶
- Several additional hyperparameters are introduced (\(K\), \(\alpha_0\), \(\omega\), \(\tau\)), though experiments suggest the method is robust to their selection.
- Theoretical analysis relies primarily on approximate derivations under the L-smooth assumption; the actual loss landscape of LLMs is considerably more complex.
- Validation is limited to mathematical reasoning tasks; the method has not been tested on other reasoning domains such as code generation or multimodal reasoning.
Related Work & Insights¶
- Policy gradient enhancements: DAPO, Dr.GRPO, and related methods address GRPO improvements from different angles.
- Lookahead Optimizer: The reposition mechanism in SFPO is directly inspired by this work.
- Sample efficiency: Methods such as ReMax and RLOO also target improved rollout utilization.
Rating¶
- Novelty: ⭐⭐⭐⭐ — The three-stage fast–reposition–slow structure constitutes a novel policy optimization paradigm.
- Technical Depth: ⭐⭐⭐⭐ — Theoretical derivations are thorough, spanning curvature analysis, proximal optimization, and adaptive scheduling.
- Experimental Thoroughness: ⭐⭐⭐⭐⭐ — Comprehensive evaluation across 5 models, 6 benchmarks, 2 training sets, with efficiency and training dynamics analysis.
- Practical Value: ⭐⭐⭐⭐⭐ — Plug-and-play integration, no additional memory cost, and significant speedup make this highly practical.