Post-Training LLMs as Better Decision-Making Agents: A Regret-Minimization Approach¶
Conference: ICML 2026
arXiv: 2511.04393
Code: Not yet released
Area: LLM Agent / Online Decision-Making / Post-Training
Keywords: Regret Minimization, Iterative SFT, Online Decision-Making, Multi-Armed Bandit, Self-Generated Reasoning
TL;DR¶
The authors propose Iterative RMFT, which ranks decision trajectories self-rolled out by the LLM from low to high regret. It selects the top \(k\) optimal trajectories to iteratively fine-tune the model via SFT. Without relying on known optimal algorithms (e.g., UCB/FTRL) or manually designed CoT templates, this approach enables LLMs to automatically emerge with no-regret behavior and a proper exploration-exploitation balance across three types of verbalized decision tasks: Multi-Armed Bandits (MAB), Online Learning, and Non-Stationary Bandits.
Background & Motivation¶
Background: Deploying LLMs as decision-making agents in multi-round interactive environments (recommendation, gaming, healthcare, operations) has become a prominent trend. However, LLMs are pre-trained for next-token prediction and are not explicitly optimized for online decision-making; thus, there is no theoretical guarantee regarding why LLMs can make effective decisions.
Limitations of Prior Work: Empirical studies show that out-of-the-box LLMs fail even at basic online decision problems: they are reluctant to explore in stochastic MABs, exhibit linear regret growth in adversarial online learning, and fail to track reward drifts in non-stationary environments. In other words, vanilla LLMs are not inherently no-regret learners on canonical "textbook" tasks.
Key Challenge: Existing post-training methods attempted to fix this through two main routes: One is "Algorithm Distillation"—distilling action sequences of known optimal algorithms (e.g., UCB, EXP3) into the LLM. This requires prior knowledge of the environment's optimal algorithm, and the resulting models are sensitive to problem structures like action space size, time horizon, and reward distribution, often failing when transferred to new verbalized tasks. The other is RL fine-tuning, but using raw rewards as signals primarily solves reward maximization; it does not naturally include exploration incentives and cannot be directly applied to adversarial or non-stationary settings.
Goal: To find a unified post-training paradigm that improves LLM decision-making capabilities across verbalized tasks without relying on known optimal algorithms, while simultaneously preserving and enhancing the LLM's CoT reasoning process.
Key Insight: Regret is a universal metric in online decision-making—FOL, MAB, and NS-MAB can all be characterized by regret/dynamic regret—and it can be calculated ex-post once a decision trajectory is obtained. Since the LLM can calculate regret after rolling out its own trajectories, regret can serve as an "ex-post judge" to filter which self-generated trajectories are worth SFT.
Core Idea: Use regret as the unique trajectory filtering signal to iteratively distill low-regret trajectories generated by the LLM back into itself (self-imitation). This allows no-regret behavior to "emerge" rather than being forced into the model.
Method¶
Overall Architecture¶
The method is a meta-algorithm applicable to three online decision environments: FOL, MAB, and NS-MAB. In an outer iteration, the LLM rolls out \(L\) trajectories across \(M\) different scenarios (verbally described decision tasks). Each trajectory consists of several (Reasoning CoT, Action) pairs, interacting entirely in natural language. Cumulative regret is then calculated for each trajectory (static regret for FOL/MAB, dynamic regret for NS-MAB). The \(k\) trajectories with the lowest regret from each scenario form the SFT dataset \(\mathcal{D}\), which is used to update the model with standard SFT loss. The new model replaces the old one for the next iteration until convergence.
The key to this process is that the training signal is strictly the regret scalar, with no additional assumptions regarding action formats, CoT templates, or optimal algorithms; the model's inherent CoT is preserved and reinforced via SFT.
Key Designs¶
-
Trajectory Selection by Regret:
- Function: Unifies "online decision evaluation" and "SFT data construction" by using regret as the sole scalar to decide which self-generated trajectories enter the next training set.
- Mechanism: For each scenario \(i\), \(L\) trajectories \(C_{1,i}, \dots, C_{L,i}\) are rolled out. Regret is calculated after each trajectory: for example, in FOL, \(\text{Regret}_\mathcal{A}((R_t)_{t\in[T]}, T) = \max_{\pi\in\Pi} \sum_t \langle \pi, R_t\rangle - \sum_t \langle \pi_{\mathcal{A}, t}, R_t\rangle\); in MAB, expected regret is used; in NS-MAB, dynamic regret \(\text{D-Regret} = \mathbb{E}[\sum_t \max_a r_t(a) - \sum_t r_t(a_{\mathcal{A},t})]\) is used. The top \(k\) trajectories with the lowest regret are selected for \(\mathcal{D}\).
- Design Motivation: Regret is a universal metric for online decision-making that does not depend on knowing the optimal algorithm, action space size, or time horizon, thereby naturally supporting training across different tasks and problem structures. Since filtering occurs at the trajectory level rather than the token level, the entire CoT is preserved, avoiding the reward-credit-assignment problem common in RL.
-
Imitation on Self-Generated Reasoning:
- Function: Updates the model using SFT instead of RL, making all natural language components (reasoning + actions) in the trajectory part of the supervision goal.
- Mechanism: Selected trajectories are formatted as complete dialogues (task description + history + reasoning + action) for SFT samples, using standard cross-entropy loss on every token. No reward model is introduced, no token-level RL is performed, and no fixed template is enforced for the action format or CoT structure.
- Design Motivation: The authors contrast this with algorithm distillation and RLFT. Algorithm distillation requires fixed output formats and relies on optimal algorithms; RLFT using reward signals cannot characterize regret in adversarial/non-stationary settings. SFT-on-self-trajectories can leverage closed-source APIs (like GPT-4o mini's fine-tuning interface), does not constrain CoT form, and allows the model to discover new "algorithmic-style" reasoning, leading to stronger generalization.
-
Meta-algorithm Instantiation (FOL / MAB / NS-MAB):
- Function: Covers three typical online decision environments with the same outer loop to verify the universality of the regret signal.
- Mechanism: FOL uses full-information reward vectors \(R_t\) to evaluate each round's action; MAB uses partial feedback \(R_t(a_t)\) and takes expectations over randomness; NS-MAB introduces a variation budget \(V_T = \sum_{t=2}^T \|r_t - r_{t-1}\|_\infty\) and uses dynamic regret as the selection criterion. The scenario library consists of verbalized tasks (medical recommendation, resource allocation, marketing, etc.). Each scenario is translated into natural language dialogues per round; the agent outputs Action + CoT, and the program parses \(a_t\) or \(\pi_t \in \Delta(\mathcal{A})\) from the output.
- Design Motivation: The ability of a single signal to cover three typical environments is the strongest empirical evidence of regret's universality. By randomizing across the scenario dimension (varying horizon, action space, reward generation, and domain context), the trained model maintains no-regret behavior on unseen scenarios rather than just memorizing lookup tables for specific horizons.
Loss & Training¶
The inner loop is standard SFT: minimizing cross-entropy on selected trajectory tokens without additional regularization. The number of outer iterations and hyperparameters \(k\), \(L\), and \(M\) are set per task. Theoretically, in a simplified scenario with a single-layer attention Transformer, the authors prove that the fixed point of this iterative "imitate lowest regret trajectories" process corresponds to the FTRL algorithm. Thus, no-regret behavior is induced by this paradigm rather than being a coincidence.
Key Experimental Results¶
Main Results¶
The experiments cover three types of models: (1) Small Transformers with numerical I/O as a controllable warm-up; (2) Open-source LLMs: Phi-3.5-mini, Gemma-2-9b-it, Qwen3-8B; (3) Closed-source LLMs: GPT-4o mini, trained via its SFT API.
| Environment | Model Type | Pre-training Behavior | After Iterative RMFT |
|---|---|---|---|
| FOL (Verbalized) | Open-source LLMs (Phi-3.5 / Gemma-2-9b / Qwen3-8B) | Linear regret growth, \(\hat\beta \approx 1\) | \(\hat\beta < 1\), significant \(p_{\text{reg}}\), sublinear regret emerges |
| MAB (Verbalized) | GPT-4o mini | High SuffFailFreq, reluctant to explore | Significant drop in SuffFailFreq, increase in MinFrac, uniform exploration |
| NS-MAB (Verbalized) | Open-source LLMs + GPT-4o mini | Dynamic regret fails to track drift | Slower dynamic regret growth, able to switch arms after reward drift |
| FOL (Numerical Transformer) | Single/Multi-layer Attention | No no-regret guarantee at init | \(\hat\beta < 1\) after training, close to FTRL baseline |
Ablation Study¶
| Configuration | Key Indicator | Description |
|---|---|---|
| Iterative RMFT (Full) | Sublinear regret growth; Exploration-exploitation balance | Complete method |
| RMFT (1 Round, Non-iterative) | Regret improves but remains near-linear | Single SFT round is insufficient to "amplify" low-regret behavior; iteration is key |
| Filtering by Reward (not Regret) | Regret rebounds in FOL/NS-MAB | Verifies cumulative reward maximization \(\neq\) regret minimization, especially in adversarial/non-stationary settings |
| No Self-Generated CoT (SFT actions only) | Generalization across scenarios drops | Self-generated reasoning is key for the model to maintain no-regret in new scenarios |
| Cross-task Generalization (Train FOL, Test MAB / Change horizon, action count, reward dist) | Maintains sublinear regret | Shows learned strategy is a general decision policy, not a pattern for a specific horizon |
Key Findings¶
- Regret is a better post-training signal than reward: Reward might suffice in stochastic environments, but in adversarial or non-stationary settings, cumulative reward maximization is not equivalent to regret minimization, causing model degradation.
- Self-generated CoT is the key source of generalization: Forcing the removal of CoT and only performing SFT on action tokens causes the model to fail when scenarios change (e.g., different reward descriptions or domains). Retaining self-generated reasoning allows for cross-task transfer.
- Iteration is mandatory: A single RMFT round only slightly lowers regret; multiple iterations gradually amplify sparse low-regret behavior patterns into the model's default behavior.
- Theoretical evidence: In a simplified single-layer attention Transformer setting, the fixed point of "imitating lowest regret trajectories" is FTRL, suggesting that no-regret behavior is a natural attractor for this paradigm.
Highlights & Insights¶
- Using regret as an "ex-post judge" instead of an "training loss" bypasses the difficulty of directly backpropagating regret through token-level autoregressive generation. This is a clever way to migrate classic online learning tools to LLM training.
- Using SFT instead of RL makes the method natively compatible with closed-source fine-tuning APIs (e.g., GPT-4o mini), significantly lowering the barrier for engineering adoption—something most RLHF/RLFT works cannot do.
- The theoretical result that "self-imitation converges to FTRL" provides a specific asymptotic property for why this self-distillation emerges with no-regret behavior, beyond mere empirical observation.
- Transferable logic: Any task where a single scalar metric can ex-post evaluate a complete trajectory (multi-turn tool use, code agents, game playing) can adopt the "rollout → rank by metric → top-\(k\) self-SFT → iterate" paradigm by replacing regret with task-specific metrics.
Limitations & Future Work¶
- The authors acknowledge that training APIs for closed-source models like GPT-4o mini do not allow full control over hyperparameters, and training costs scale linearly with the number of scenarios and iterations.
- Theoretical results are limited to simplified single-layer attention settings; whether self-imitation still converges to no-regret algorithms on multi-layer Transformers remains an open question.
- Evaluation still focuses on canonical online DM tasks; truly complex linguistic decisions (e.g., multi-step tool use + long context) are only covered by variant scenarios rather than end-to-end agent benchmarks.
- Ranking by regret requires the ability to calculate regret ex-post, which necessitates either an oracle optimal strategy or full reward feedback. For many real-world scenarios (e.g., RLHF-style human preference feedback), the problem of "how to define regret" must be solved first.
- Future improvements: Replacing regret estimation with "relative gap estimation from the optimal strategy" provided by an LLM-as-judge could extend this paradigm to real-world tasks without an oracle. Alternatively, replacing SFT with DPO for preference optimization on "low regret vs. high regret trajectories" could improve sample efficiency.
Related Work & Insights¶
- vs Nie et al. 2025 (Algorithm Distillation): They distill action sequences of known optimal algorithms (e.g., UCB); Ours does not rely on known optimal algorithms, automatically filtering self-generated trajectories via regret for better generalization.
- vs Schmied et al. 2026 (RLFT): Both use self-generated CoT, but RLFT uses rewards as signals and relies on UCB-style manual CoT templates. Ours uses regret signals, does not constrain CoT format, and covers adversarial and non-stationary environments.
- vs Park et al. 2025b (Regret-loss): They backpropagate regret directly as a loss in numerical Transformers to obtain FTRL. Ours extends this idea from "explicitly optimizing regret" to "using regret to select trajectories for SFT," making it applicable to verbalized I/O and closed-source LLMs.
- vs General RLHF / RLAIF: RLHF targets reward maximization rather than regret minimization and usually does not explicitly incentivize exploration. Ours demonstrates that for decision-making tasks, regret is a more suitable training signal than reward.