Provable and Practical In-Context Policy Optimization for Self-Improvement¶
Conference: ICLR 2026 arXiv: 2603.01335 Code: https://github.com/UNCSciML/ICPO Area: Optimization Keywords: in-context learning, policy optimization, test-time scaling, self-reflection, mathematical reasoning
TL;DR¶
This paper proposes the In-Context Policy Optimization (ICPO) framework, theoretically proving that a single-layer linear self-attention Transformer, after sufficient pretraining, can simulate a policy optimization algorithm in context. Building on this, the paper designs a practical ME-ICPO algorithm that achieves multi-round test-time self-reflection via minimum-entropy selection and self-evaluation rewards, yielding significant gains on mathematical reasoning tasks (Qwen2.5-Math-7B improves from 11% to 30% on AIME 2024).
Background & Motivation¶
Background: Test-time scaling has become an important paradigm for improving LLM reasoning—models iteratively refine their answers through multi-round self-reflection without parameter updates. Representative methods include Chain-of-Thought, Tree-of-Thoughts, Best-of-N, and Self-Refine.
Limitations of Prior Work: (a) Why does self-reflection emerge from pretraining? Prior work (e.g., Park et al. 2024) directly assumes LLMs possess posterior sampling/policy optimization capabilities without explaining their origin. (b) Theoretical analyses of in-context learning have focused on supervised learning (linear regression) and value function learning (TD learning), with no theory addressing policy optimization. (c) Existing methods such as Tree-of-Thoughts require multi-step search, incurring substantial computational overhead.
Key Challenge: How can a model leverage historical attempts and reward feedback in context to optimize its own output policy? Can Transformers theoretically realize such policy optimization without parameter updates?
Goal: (1) Provide a theoretical foundation for LLM self-reflection and self-improvement behavior. (2) Design a practical test-time scaling algorithm.
Key Insight: Self-reflection is formalized as policy optimization in a K-armed bandit problem—the agent generates a response (action), receives a reward, and accumulates history \(\{(\mathbf{x}_1, r_1), \ldots, (\mathbf{x}_t, r_t)\}\) in context to optimize subsequent behavior.
Core Idea: The self-attention mechanism of Transformers has a natural inductive bias for simulating FTRL-based policy optimization, and after sufficient pretraining, it can perform policy optimization in context.
Method¶
Overall Architecture¶
The ICPO framework: given a problem → the model generates response \(\mathbf{x}_t\) → receives reward \(r_t\) (self-evaluated or external) → appends \((\mathbf{x}_t, r_t)\) to the context history → the model generates an improved response \(\mathbf{x}_{t+1}\) based on the updated history → repeat.
Theoretical analysis is conducted on linear self-attention (LSA) Transformers, proving that they can exactly simulate an FTRL-based policy optimization algorithm.
Key Designs¶
-
Fisher-weighted logit-matching pretraining objective:
-
Function: A novel supervised pretraining loss that enables the Transformer to perform in-context policy optimization.
- Mechanism: The loss is \(\mathcal{L}(\theta) = \frac{1}{2} \mathbb{E}_{\tau \in \mathcal{D}} [\sum_t \| \text{Proj}(\hat{\mathbf{s}}_{\tau,t+1} - \mathbf{s}_{\tau,t+1}^{\text{PO}}) \|_{\Gamma}^2]\), where \(\Gamma\) is the Fisher information matrix of the policy and Proj projects out the constant bias (which does not affect the softmax policy).
-
Design Motivation: Fisher weighting makes the loss proportional to the KL divergence (Theorem 4.1), explaining why a standard KL loss suffices to enable Transformers to learn self-reflection.
-
Population Equivalence and finite-sample guarantees:
-
Function: Proves that after sufficient pretraining, a single-layer LSA can exactly simulate the target policy optimization algorithm.
- Mechanism: Theorem 4.2 shows that the optimal parameters \(\theta^*\) enable the LSA to exactly reproduce policy optimization behavior over all possible histories; Theorem 4.3 provides a sample complexity of \(\tilde{O}(N^2 K / c_\lambda^2)\).
-
Design Motivation: Provides theoretical evidence for LLMs' in-context policy optimization capability—a single attention layer suffices.
-
Robustness guarantee (Reward Shock Stability):
-
Function: Analyzes the stability of the ICPO loop against a single reward perturbation.
- Mechanism: Theorem 4.8 proves that when the learning rate \(\eta_t = c/t\) is sufficiently small, the effect of a one-time reward perturbation \(\delta_r\) on the policy decays to zero over time: \(\mathbb{E}[\|\Delta \hat{\mathbf{p}}_{t+1}^s\|_2] \leq \frac{a(1+C_b)}{s} (\frac{t}{s})^{b-1} |\delta_r|\).
-
Design Motivation: Provides theoretical support for using noisy self-evaluation rewards.
-
ME-ICPO practical algorithm:
-
Function: Translates the theoretical framework into a practical test-time inference algorithm.
- Mechanism: (1) Sample \(k\) candidate responses per round; (2) use Majority Vote to determine the self-evaluation reward \(r_j^{(t)} = \mathbb{1}[a_j^{(t)} = \hat{a}_t]\); (3) summarize and compress the CoT; (4) minimum-entropy selection—select the candidate that minimizes the entropy of subsequent responses to add to the context.
- Design Motivation: Minimum-entropy selection follows the "pessimism" principle of offline RL—choosing the direction the agent is most confident about, avoiding being misled by noisy rewards.
Loss & Training¶
ME-ICPO involves no parameter updates at test time—it is a pure inference-time algorithm. Core strategy: - Sample \(k=16\) responses per round - Majority vote as reward estimate - CoT summarization to compress context length - Minimum-entropy selection for robustness - Output the final response after \(n\) iterations
Key Experimental Results¶
Main Results¶
| Model | Benchmark | Base Mean@16 | w/ ME-ICPO Mean@16 | Gain |
|---|---|---|---|---|
| Qwen2.5-Math-7B | AIME 2024 | 11.04 | 30.42 | +19.38 |
| Qwen2.5-Math-7B | AMC | 41.42 | 47.06 | +5.64 |
| Qwen2.5-Math-7B | MATH-L5 | 30.58 | 38.71 | +8.13 |
| Qwen2.5-Math-1.5B | AIME 2024 | 6.46 | 9.79 | +3.31 |
| Qwen2.5-Math-1.5B | MATH-L1 | 49.27 | 57.06 | +12.38 |
Gains are most pronounced on AIME 2024: +19.38 for the 7B model and +3.31 for the 1.5B model. ME-ICPO's Mean@16 can exceed the baseline model's Maj@k upper bound.
Ablation Study¶
| Configuration | AIME 2024 Accuracy (%) |
|---|---|
| w/o Reward | 19.30 |
| w/o Entropy | 5.77 |
| w/o Entropy & Reward | 6.21 |
| ME-ICPO (full) | 30.05 |
| ME-ICPO Oracle | 38.19 |
Key Findings¶
- Minimum-entropy selection is the most critical component: removing it causes accuracy to plummet from 30.05% to 5.77%, worse than doing nothing (6.21%)—indicating that without a principled selection strategy, random context is actually harmful.
- Reward signal also matters: removing it reduces accuracy from 30.05% to 19.30%.
- Theoretical validation experiments: the policy-matching error of LSA converges rapidly to numerical precision, and the effect of a single reward shock indeed decays over time.
- ME-ICPO's Mean@16 can surpass the baseline's Maj@k upper bound—indicating that in-context policy optimization learns information beyond simple voting.
Highlights & Insights¶
- Closed-loop theory-to-practice: Starting from a theoretical analysis of linear self-attention, the paper derives practical algorithm design principles (reward guidance + minimum-entropy selection) and validates them on real LLMs—a complete research loop.
- Insight behind minimum-entropy selection: Rather than selecting the highest-reward candidate, the algorithm selects the candidate toward which the model is most confident. This is especially important when self-evaluation rewards are noisy—a high reward may be coincidental, but low entropy reflects a stable consensus across the model's outputs.
- Single-layer sufficiency: Unlike Lin et al. (2023), which requires \(O(\sqrt{T})\) layers, ICPO requires only a single-layer LSA and does not need more layers as context length grows—a result more aligned with the long-context setting of real LLMs.
Limitations & Future Work¶
- The theoretical analysis is based on linear self-attention and linear bandit assumptions, which differ substantially from real LLMs and mathematical reasoning problems.
- ME-ICPO samples 16 candidate responses per round, and the computational overhead of multi-round iteration remains considerable.
- Self-evaluation rewards are based on Majority Vote, which may produce incorrect signals when the model makes systematic errors.
- Experiments are limited to mathematical reasoning tasks; code generation, logical reasoning, and other domains have not been evaluated.
- CoT summarization may discard critical reasoning step information.
Related Work & Insights¶
- vs. Self-Refine/Reflexion: These methods perform self-reflection via natural language feedback but lack a theoretical explanation for their effectiveness; ICPO provides a theoretical foundation from a policy optimization perspective.
- vs. Tree-of-Thoughts: ToT searches at every step, whereas ME-ICPO optimizes the entire CoT per round—coarser-grained but more computationally efficient.
- vs. TTRL: TTRL performs gradient updates at test time; ME-ICPO is purely in-context with no parameter updates—substantially more lightweight.
- vs. Best-of-N: BoN selects the single best response; ME-ICPO leverages multi-round iteration to accumulate contextual information and progressively improve—theoretically superior.
Rating¶
- Novelty: ⭐⭐⭐⭐⭐ First theoretical analysis of LLM self-reflection from a policy optimization perspective; minimum-entropy selection is a novel design.
- Experimental Thoroughness: ⭐⭐⭐⭐ Theoretical validation is thorough, but LLM experiments are limited to mathematical tasks and the Qwen model family.
- Writing Quality: ⭐⭐⭐⭐ Theoretical derivations are rigorous, though the transition from theory to practice could be tighter.
- Value: ⭐⭐⭐⭐ Provides a theoretical foundation for test-time scaling; the minimum-entropy selection strategy has practical utility.