Think Outside the Policy: In-Context Steered Policy Optimization¶

Conference: ACL 2026 arXiv: 2510.26519 Code: GitHub Area: LLM Reasoning / Reinforcement Learning Keywords: Reinforcement Learning, In-Context Learning Steering, Policy Optimization, Exploration Enhancement, Mathematical Reasoning

TL;DR¶

This paper proposes ICPO (In-Context Steered Policy Optimization), which leverages the in-context learning (ICL) capability of large language models as implicit expert guidance to expand the policy exploration space during RLVR training, without relying on reasoning trajectories from external, stronger models.

Background & Motivation¶

State of the Field: Reinforcement learning with verifiable rewards (RLVR), particularly the GRPO algorithm, has become the dominant paradigm for enhancing mathematical reasoning in large reasoning models (LRMs). However, GRPO relies on on-policy sampling, where all trajectories are drawn from the current policy distribution, limiting exploration diversity.

Limitations of Prior Work: (1) On-policy exploration is confined to the current policy distribution, yielding insufficient trajectory diversity and susceptibility to local optima; (2) existing methods that expand the exploration space (e.g., LUFFY) depend on reasoning trajectories generated by stronger LRMs as off-policy samples, but such advanced models are computationally expensive and not always accessible; (3) directly incorporating external trajectories may introduce noise, undermining training stability.

Root Cause: RLVR requires sufficient exploration diversity to discover better policies, yet on-policy sampling inherently restricts the exploration range; while introducing external expert trajectories is effective, it creates a dependency on external resources.

Paper Goals: Design an RLVR framework that does not rely on external stronger models, leveraging the model's own capabilities to expand the exploration space and improve training effectiveness.

Starting Point: ICL is fundamentally a form of implicit expert-conditioned reasoning — by providing demonstrations in the input, the model shifts its inference distribution toward expert-aligned regions without any parameter updates. Incorporating ICL-steered trajectories into GRPO training enables a form of Implicit Expert Forcing (IEF).

Core Idea: Use demonstrations from existing datasets (e.g., the MATH training set) as ICL demonstrations to guide the model in generating off-policy trajectories, eliminating the need for external stronger models, while ensuring training stability through reject sampling and annealed reward shaping.

Method¶

Overall Architecture¶

ICPO introduces three components on top of standard GRPO: (1) Mixed-Policy GRPO with Implicit Expert Forcing (IEF), which uses ICL to generate off-policy trajectories that expand the exploration space; (2) Expert Region Reject Sampling (ERRS), which filters low-quality off-policy trajectories; and (3) Annealed Expert Reward Shaping (RS), which balances early expert guidance with late-stage autonomous optimization. Each prompt generates 8 trajectories (7 on-policy + 1 off-policy ICL-steered).

Key Designs¶

Mixed-Policy GRPO with Implicit Expert Forcing (IEF):
- Function: Mixes on-policy and ICL-steered off-policy trajectories in the GRPO rollout group to expand the exploration space.
- Mechanism: For each prompt \(q\), demonstrations \(\mathcal{D}\) are randomly sampled from the MATH dataset and concatenated to form \(x_{\mathrm{exp}}=[\mathcal{D};q]\). An ICL-steered trajectory \(\tau_{\mathrm{exp}} \sim \pi_\theta(\tau|x_{\mathrm{exp}})\) is then generated from the model. From the hypothesis-class perspective of ICL, the Transformer internally maps demonstrations to a task vector \(\vartheta = A(\mathcal{D})\), which is equivalent to implicitly introducing an expert prior. Group-relative advantages are recomputed over the mixed rollout group.
- Design Motivation: Although all trajectories originate from the same model \(\pi_\theta\), ICL conditioning shifts the input distribution, biasing trajectories toward expert-aligned regions — an input-conditioned off-policy approach that requires no additional models.
Expert Region Reject Sampling (ERRS):
- Function: Filters low-quality ICL-steered trajectories to prevent noise from contaminating policy updates.
- Mechanism: An expert region is defined as \(\mathcal{E}_{\mathrm{exp}} = \{(x_{\mathrm{exp}}, \tau_j) | R(\tau_j) \geq \delta\}\); only ICL-steered trajectories whose rewards exceed the threshold \(\delta=1.0\) (i.e., correct answers) are included in training. The reject sampling operator \(\rho\) ensures that only high-reward trajectories participate in policy updates.
- Design Motivation: ICL steering does not always produce correct answers; using all off-policy trajectories indiscriminately introduces misleading gradients, whereas ERRS guarantees the reliability of training signals.
Annealed Expert Bonus Reward Shaping:
- Function: Strengthens expert guidance in early training and gradually relaxes it in later stages to promote autonomous optimization.
- Mechanism: A linearly decaying reward bonus \(R_{\mathrm{shaped}}(\tau) = R(\tau) + \alpha \cdot \gamma(t)\) is added to correct trajectories within the expert region, where \(\gamma(t) = 1 - t/T\) is a linear annealing scheduler. This encourages greater imitation of expert behavior early in training, transitioning to autonomous exploration later.
- Design Motivation: A fixed expert reward bonus may cause over-reliance on expert behavior; the annealing design achieves a smooth transition from "following the expert" to "autonomous reasoning."

Loss & Training¶

The final objective \(\mathcal{J}_{\mathrm{ICPO}}(\theta)\) comprises both on-policy and off-policy terms, with the off-policy component adjusted via reject sampling and importance ratios. A regularized importance sampling function \(f(x) = x/(x+\lambda)\) (\(\lambda=0.01\)) is applied to off-policy trajectories for policy shaping. A KL regularization term is also retained to prevent excessive policy drift.

Key Experimental Results¶

Main Results¶

Model	Method	AIME24/25	MATH-500	Olympiad	Avg.	Avg. Gain
Qwen3-1.7B	GRPO	28.4/22.5	83.6	48.2	48.4	-
Qwen3-1.7B	ICPO	31.3/26.3	86.8	56.4	52.5	+4.1
Qwen3-8B	GRPO	54.8/38.5	91.0	62.4	63.5	-
Qwen3-8B	ICPO	55.2/43.7	92.0	65.2	65.7	+2.2
Qwen2.5-Math-7B	LUFFY	-	87.6	57.2	50.1	-
Qwen2.5-Math-7B	ICPO†	-	86.6	53.6	53.4	+3.3 vs LUFFY

Ablation Study¶

Configuration	Avg. (1.7B)	Avg. (8B)	Note
ICPO (full)	51.8	65.8	Full model
− ERRS	50.6	65.0	Removing reject sampling degrades performance
− IEF (= GRPO)	48.4	63.8	Removing ICL steering reverts to standard GRPO
CoT vs. PoT expert data	51.8 vs. 51.5	65.8 vs. 65.1	Robust to expert data type

Key Findings¶

ICL-steered trajectories not only improve accuracy but also enhance trajectory diversity (larger edit distances) and distribution quality (higher "flip" ratio — incorrect to correct).
ICPO maintains higher policy entropy throughout training, reflecting broader policy support and more thorough exploration.
ICPO is robust to the choice of expert data — consistent gains are achieved even with cross-domain data in program-of-thought (PoT) format.
The ICPO† variant with reward shaping performs better on OOD benchmarks, indicating that the annealing strategy aids generalization.

Highlights & Insights¶

Theoretical Perspective of ICL as Implicit Expert Forcing: The paper establishes an elegant theoretical link between the hypothesis-class decomposition of ICL and expert forcing.
Zero Dependency on External Models: In contrast to methods such as LUFFY, ICPO requires no external stronger models — only existing datasets serve as ICL demonstrations.
Plug-and-Play Framework Design: The target policy distribution can be flexibly adjusted by swapping expert data, offering strong extensibility.
Visualization of Training Dynamics: Reward and entropy curves clearly demonstrate the advantages of ICPO over GRPO.

Limitations & Future Work¶

Experiments are primarily conducted on mathematical reasoning; cross-domain generalization (e.g., code generation, commonsense reasoning) remains insufficiently validated.
The quality of ICL steering depends on the quality of demonstrations, and may be limited for extremely difficult problems.
Only 1 off-policy trajectory per prompt is used (7+1 configuration); more flexible ratio strategies warrant exploration.
Future work could combine ICPO with other exploration-enhancement techniques, such as temperature scheduling and replay buffers.

vs. LUFFY: LUFFY requires reasoning trajectories from stronger LRMs as off-policy samples; ICPO replaces the external model with ICL, surpassing LUFFY by +3.3 on Qwen2.5-Math-7B.
vs. GRPO + Extra Rollouts: Simply increasing the number of rollouts yields limited gains; the ICL steering in ICPO provides a more effective exploration signal.
vs. ReLIFT: ReLIFT alternates between RL and SFT, introducing training instability; ICPO unifies SFT signals and RL optimization within a single framework.

Rating¶

Novelty: ⭐⭐⭐⭐ — Reinterpreting ICL as implicit expert forcing and integrating it into the RLVR framework is a genuinely novel idea.
Experimental Thoroughness: ⭐⭐⭐⭐ — Comprehensive evaluation across multiple models and benchmarks, including ablations, expert data type analysis, and training dynamics.
Writing Quality: ⭐⭐⭐⭐ — Clear framework presentation with complete derivations; theoretical motivation and empirical validation are well integrated.
Value: ⭐⭐⭐⭐ — Provides a low-cost paradigm for exploration enhancement in RLVR, with practical utility for LRM post-training.