🎮 Reinforcement Learning¶
🧪 ICML2026 · 20 paper notes
📌 Same area in other venues: 💬 ACL2026 (22) · 📷 CVPR2026 (19) · 🔬 ICLR2026 (138) · 🤖 AAAI2026 (70) · 🧠 NeurIPS2025 (168) · 📹 ICCV2025 (7)
🔥 Top topics: Reinforcement Learning ×9 · Reasoning ×4 · Agents ×2 · Adversarial Robustness ×2
- CAMEL: Confidence-Gated Reflection for Reward Modeling
-
This paper observes that the log-probability margin of the verdict token is highly correlated with judgment accuracy. Based on this, CAMEL is proposed: it first makes a quick preference judgment using a single token, only triggering reflective generation when confidence is low. Counterfactual prefix augmentation is used to enhance GRPO training for self-correction. On three reward model benchmarks, a 14B parameter model achieves an average accuracy of 82.9% (surpassing the previous best 70B model by 3.2%).
- CPMöbius: Iterative Coach–Player Reasoning for Data-Free Reinforcement Learning
-
Transforms self-play from "adversarial" to "collaborative": the Coach generates problems, the Player solves them, and the Coach receives a reward equal to "Player improvement × Player solve rate." Without any external training data, Qwen2.5-Math-7B-Instruct achieves an average +4.9 and OOD +5.4 across six math benchmarks, surpassing existing unsupervised methods like RENT/R-Zero.
- DR.Q: Debiased Model-based Representations for Sample-efficient Continuous Control
-
DR.Q builds upon the MR.Q "model-based representation + actor-critic" framework with two key additions: (1) explicitly maximizes the mutual information between \(z_{sa}\) and the next state representation \(z_{s'}\) using InfoNCE; (2) introduces "faded prioritized replay," a fusion of "PER × forget," to mitigate overfitting to early experiences. With a single hyperparameter set, DR.Q outperforms strong baselines such as SimBaV2, MR.Q, and TDMPC2 across 73 continuous control tasks.
- EARL: Towards a Unified Analysis-Guided Reinforcement Learning Framework for Egocentric Interaction Reasoning and Pixel Grounding
-
EARL employs a "coarse analysis–fine response" two-stage MLLM framework to unify egocentric interaction understanding tasks (description + QA + pixel mask) into a single pipeline: the first stage outputs a global description of the entire image and uses the last hidden state as a semantic prior, which is then injected into the second stage via a novel Analysis-guided Feature Synthesizer. Joint training is performed using GRPO and three types of rewards (format/answer/grounding accuracy). On Ego-IRGBench, EARL surpasses Seg-Zero in cIoU by 8.37%.
- From Self-Evolving Synthetic Data to Verifiable-Reward RL: Post-Training Multi-turn Interactive Tool-Using Agents
-
To address two major bottlenecks in post-training "multi-turn interactive tool-using agents"—the high cost of quality data and RL signal corruption from user simulation noise—the authors propose "self-evolving multi-agent data synthesis (AReaL-SEA)" paired with executable verifiers as rewards. Combined with an RL recipe of "first SFT the user model, then large batch + dynamic filtering GRPO," this approach pushes Qwen3-235B to Airline 73.0 / Telecom 98.3 pass^1 on τ²-bench, matching or surpassing Claude/Gemini/GPT-5 across the board.
- How Reasoning Evolves from Post-Training Data: An Empirical Study Using Chess
-
The authors use "training LLMs to play chess" as a clean, verifiable RL testbed, systematically comparing the impact of six custom SFT datasets on RL. They find that "directly predicting the best move" achieves the highest scores but leads to unfaithful reasoning after RL, while "predicting the best line" yields comparable performance but more stable and faithful reasoning post-RL. Three metrics are distilled for predicting RL end performance from SFT checkpoints. Ultimately, a 7B model surpasses gpt-oss-120b on multiple chess benchmarks.
- Long-Horizon Model-Based Offline Reinforcement Learning Without Explicit Conservatism
-
This work challenges the mainstream consensus that "offline RL must be explicitly conservative," and proposes Neubay: adopting a Bayesian perspective on the posterior model ensemble, using long-horizon rollouts (hundreds of steps) to naturally absorb value overestimation, and controlling compounding error via layer norm and uncertainty thresholds. As a result, Neubay matches SOTA conservative algorithms on 33 D4RL/NeoRL datasets without pessimistic penalties, and sets new records on 7 datasets.
- Multi-Agent Decision-Focused Learning via Value-Aware Sequential Communication
-
SeqComm-DFL treats "multi-agent communication" as a predictor and "joint policy selection" as a downstream optimizer. By employing value-aware message generation, Stackelberg sequential conditioning, and implicit differentiation for bilevel optimization, it directly aligns communication learning with team return. This approach achieves a 4-6x cumulative reward improvement and over 13 percentage points increase in win rate on hospital scheduling and SMAC benchmarks.
- Path-Coupled Bellman Flows for Distributional Reinforcement Learning
-
Explicitly weaves the affine transport geometry of the distributional Bellman equation into the flow matching path: uses a shared base noise to simultaneously drive the paths of the current and successor states, and leverages a \(\lambda\) control variate to trade off bias and variance, resulting in a distributional critic that is source-consistent, Bellman endpoint-consistent, and stable.
- Probing RLVR Training Instability through the Lens of Objective-Level Hacking
-
The authors propose the "objective-level hacking" framework, attributing the increasing training-inference discrepancy in MoE large models under RLVR to biased pseudo-signals introduced by token-level weight distortion in the optimization objective. Through four sets of experiments on a 30B MoE, they verify that bias (not variance) is the root cause.
- QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL
-
QHyer replaces the trajectory-dependent RTG in Decision Transformer with state-dependent Q-values estimated by Normalizing Flows, and stacks a gated Hybrid Attention-Mamba backbone to achieve content-adaptive historical compression. It sets new SOTA on both non-Markovian and Markovian offline goal-conditioned RL datasets (OGBench/D4RL).
- R2R2: Robust Representation for Intensive Experience Reuse via Redundancy Reduction in Self-Predictive Learning
-
R2R2 incorporates VICReg-style redundancy reduction constraints into self-predictive learning (SPL) to stabilize high UTD training, with the key modification being the omission of zero-centering—theoretically proving that zero-centering removes the constant eigenmode (i.e., global dynamics information) in the spectral decomposition of SPL. Experiments show that on TD7 with UTD=20, the score increases from 1.02 to 1.24 (+22%), and the newly proposed SimbaV2-SPL architecture achieves new SOTA in continuous control.
- Recovering Hidden Reward in Diffusion-Based Policies
-
EnergyFlow explicitly parameterizes the score field of diffusion policy as the negative gradient of a scalar energy function, and proves that under maximum-entropy optimality, the score equals the gradient of the soft Q-function. This provides a scalar signal usable as a downstream RL shaping reward "for free" without adversarial optimization, while the conservative field constraint improves OOD generalization.
- SPHERE: Mitigating the Loss of Spectral Plasticity in Mixture-of-Experts for Deep Reinforcement Learning
-
This paper formalizes the plasticity loss of MoE policies in continual reinforcement learning as a decline in the empirical NTK matrix spectral entropy effective rank, reduces it via Gauss-Newton and Kronecker decomposition to a computable proxy dependent only on the "expert feature Gram matrix," and finally uses a one-line Parseval penalty (SPHERE) to increase this proxy. On MetaWorld and HumanoidBench continual RL settings, task success rates are improved by 133% and 50%, respectively.
- Stochastic Minimum-Cost Reach-Avoid Reinforcement Learning
-
This paper proposes the Reach-Avoid Probability Certificate (RAPC), which uses a max-min-clamped Bellman contraction operator to lower-bound the value function by the reach-avoid probability. It introduces an adversarial \(\gamma^T\)-decay "compensation factor" for normalization, and employs symmetric gradient projection to jointly optimize the conflicting objectives of "cost" and "reach-avoid probability." On MuJoCo, it achieves both lower cumulative cost and higher reach success rate than RC-PPO / RESPO / CPPO.
- T\(^2\)PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning
-
T\(^2\)PO attributes training collapse in multi-turn agentic RL to "hesitation"—overthinking at the token level and repeated ineffective actions at the turn level. It introduces a self-calibrated uncertainty signal \(M_t\) (combining entropy and confidence) to simultaneously drive token-level Thinking Intervention (dynamic truncation of think segments) and turn-level Dynamical Sampling (resampling ineffective turns). This approach consistently outperforms PPO/GRPO/GiGPO on WebShop, ALFWorld, and Search QA, achieving stable training.
- Towards Efficient and Expressive Offline RL via Flow-Anchored Noise-conditioned Q-Learning
-
This paper proposes FAN: compressing "expensive generative policy + distributional critic" into "single-step flow anchoring + single noise-sample critic"—using Flow Anchoring to complete behavior regularization within one flow evaluation, and replacing quantile multi-sample with a single Gaussian noise sample in the noise-conditioned critic. Achieves SOTA performance on D4RL/OGBench while training 5-14× faster than comparable distributional methods.
- Trajectory-Level Data Augmentation for Offline Reinforcement Learning
-
This paper proposes LIFT: in active localization tasks, it leverages the geometric properties of trajectories to "shortcut" redundant zig-zag paths left by suboptimal logging policies, synthesizes these transitions, and feeds them to a lightweight augmentor that replaces logging actions during data collection. This enables offline CQL to significantly outperform standard offline RL and warm-start SAC across low- to high-dimensional, partial observation, and other settings.
- Turning Drift into Constraint: Robust Reasoning Alignment in Non-Stationary Multi-Stream Environments
-
This work reinterprets the reasoning "drift" among multiple MLLMs as negative sample constraints in DPO, using Plackett-Luce preference loss to simultaneously suppress the divergent trajectories of N source models. As a result, a 7B student model, without ground-truth reports and using only 10% of MIMIC-CXR, surpasses all source teachers in chest X-ray classification and report generation tasks.
- Vulnerable Agent Identification in Large-Scale Multi-Agent Reinforcement Learning
-
This paper investigates the bi-level NP-hard problem of "identifying the K most vulnerable agents in a large-scale MARL system with N agents." The problem is formulated as HAD-MFC (Hierarchical Adversarial Decentralized Mean Field Control). The authors use the Fenchel-Rockafellar transform to fold the lower-level worst-case adversarial policy training into a regularized "robust mean-field Bellman operator," and convert the upper-level combinatorial selection into an MDP with dense rewards, solvable via greedy or RL methods. They prove the decomposition preserves optimality and outperform baselines in 17 out of 18 tasks.