🎮 Reinforcement Learning¶

🧠 NeurIPS2025 · 169 paper notes

A Differential and Pointwise Control Approach to Reinforcement Learning: This paper reformulates the RL problem via the differential dual form of continuous-time control, embeds physical priors through Hamiltonian structure, and proposes the dfPO algorithm for pointwise policy optimization. On scientific computing tasks (surface modeling, grid-based control, molecular dynamics), dfPO surpasses 12 RL baselines with fewer samples.
A Generalized Bisimulation Metric of State Similarity between Markov Decision Processes: From Theoretical Propositions to Applications: This paper extends the classical bisimulation metric (BSM), which is limited to measuring state similarity within a single MDP, to cross-MDP settings by proposing a Generalized Bisimulation Metric (GBSM). The authors rigorously prove three fundamental metric properties — symmetry, cross-MDP triangle inequality, and an upper bound on same-state distances — and derive tighter error bounds and closed-form sample complexities than standard BSM in three applications: policy transfer, state aggregation, and sampling-based estimation.
A Near-optimal, Scalable and Parallelizable Framework for Stochastic Bandits Robust to Adversarial Corruptions and Beyond: This paper proposes BARBAT, an improvement over the classical BARBAR algorithm. By fixing epoch lengths and adjusting failure probabilities per epoch, BARBAT reduces the regret of stochastic multi-armed bandits under adversarial corruptions from \(O(\sqrt{K}C)\) to the near-optimal \(O(C)\) (eliminating the \(\sqrt{K}\) factor), and successfully extends to multi-agent, graph bandit, combinatorial semi-bandit, and batched bandit settings.
A Theory of Multi-Agent Generative Flow Networks: This paper proposes a theoretical framework for Multi-Agent Generative Flow Networks (MA-GFlowNets) and establishes a "local-global principle" — the joint flow function can be decomposed into a product of individual agents' local flows. Four algorithms are designed (CFN/IFN/JFN/CJFN), among which JFN and CJFN realize Centralized Training with Decentralized Execution (CTDE). The proposed methods outperform RL and MCMC baselines on Hyper-Grid and StarCraft environments.
A Unifying View of Linear Function Approximation in Off-Policy RL Through Matrix Splitting and Preconditioning: This work is the first to introduce matrix splitting theory, unifying TD, FQI, and PFQI under linear function approximation as iterative methods for solving the same target linear system \((\Sigma_{cov} - \gamma\Sigma_{cr})\theta = \theta_{\phi,r}\), differing only in their preconditioners. It establishes necessary and sufficient conditions for the convergence of each algorithm, introduces the novel concept of rank invariance, and reveals that target networks are fundamentally a continuous transformation of the preconditioner from a constant to a data-adaptive form.
Act to See, See to Act: Diffusion-Driven Perception-Action Interplay for Adaptive Policies: This paper proposes DP-AG (Action-Guided Diffusion Policy), which uses the Vector-Jacobian Product (VJP) of a diffusion policy's noise prediction as a structured stochastic force to drive dynamic evolution of latent observation features across diffusion steps, and closes the perception-action loop via a cycle-consistent contrastive loss. DP-AG achieves +6% on Push-T, +13% on Dynamic Push-T, and +23%+ success rate on a real UR5 robot.
Actor-Free Continuous Control via Structurally Maximizable Q-Functions: This paper proposes Q3C (Q-learning for Continuous Control with Control-points), which approximates the Q-function via a learned set of control points such that the maximum value is structurally attained at one of those points. Combined with action-conditioned Q-value generation, a control-point diversity loss, and scale normalization, Q3C matches TD3 on standard benchmarks and substantially outperforms all actor-critic methods in constrained action spaces.
Adaptive Cooperative Transmission Design for URLLC via Deep RL: This paper proposes DRL-CoLA, a dual-agent DQN algorithm that adaptively configures 5G NR transmission parameters (numerology, mini-slot, MCS) at the source and relay nodes respectively. Operating over a two-hop relay system with only local CSI, DRL-CoLA achieves URLLC reliability close to the optimum attained under full global CSI.
Adaptive Neighborhood-Constrained Q Learning for Offline Reinforcement Learning: This paper proposes ANQ (Adaptive Neighborhood-constrained Q learning), which introduces advantage-function-based adaptive neighborhood constraints for offline RL. ANQ offers a flexible middle ground between density constraints (overly conservative) and support constraints (requiring precise behavior policy modeling), and realizes efficient Q learning via a bilevel optimization framework, achieving state-of-the-art performance on the D4RL benchmark.
Adaptively Coordinating with Novel Partners via Learned Latent Strategies: This paper proposes the TALENTS framework, which learns a latent strategy space via a VAE, discovers strategy types through K-Means clustering, and performs online teammate-type inference using the Fixed-Share regret minimization algorithm, enabling zero-shot real-time adaptive coordination with unknown human or agent teammates.
ALINE: Joint Amortization for Bayesian Inference and Active Data Acquisition: ALINE proposes a unified framework for amortized Bayesian inference and active data acquisition. By combining a Transformer architecture with RL-based training, the model simultaneously learns to strategically select the most informative data points and perform instant posterior inference. It further supports flexible data acquisition targeting specific parameter subsets or predictive objectives.
Approximating Shapley Explanations in Reinforcement Learning: This paper proposes FastSVERL, a scalable parametric learning framework that separately approximates the two computational bottlenecks of Shapley values in reinforcement learning—the characteristic function and the Shapley summation—while supporting off-policy learning and continuous explanation updates as the policy evolves.
Automaton Constrained Q-Learning: This paper proposes ACQL (Automaton Constrained Q-Learning), which translates Linear Temporal Logic (LTL) task specifications into automata and combines goal-conditioned learning with minimal safety constraints. ACQL is the first scalable method to simultaneously support sequential temporal goals and non-stationary safety constraints in continuous control environments.
Bandit and Delayed Feedback in Online Structured Prediction: This paper is the first to study bandit and delayed feedback settings in online structured prediction. By designing a novel pseudo-inverse matrix gradient estimator, it achieves an \(O(T^{2/3})\) surrogate regret bound that does not explicitly depend on the output set size \(K\).
BEAST: Efficient Tokenization of B-Splines Encoded Action Sequences for Imitation Learning: BEAST parameterizes action sequences via B-splines—estimating control points through ridge regression and uniformly quantizing them into fixed-length tokens—achieving 20× token compression (100 steps → 5 tokens), mathematically guaranteed \(C^0\) continuity across action chunks, a top-1 success rate on LIBERO-Long (86.4%), and an inference throughput of 617 Hz (2.14× faster than π₀ and 101× faster than OpenVLA).
Behavior Injection: Preparing Language Models for Reinforcement Learning: This paper identifies the root cause of inconsistent LLM responses to RL fine-tuning. Through per-step influence analysis, it reveals that RL effectiveness depends on (1) the distribution of rollout accuracy (moderate is optimal) and (2) data co-influence magnitude. The proposed BRIDGE method injects exploration/exploitation behaviors during SFT, boosting subsequent RL gains from 6% to 46.6%.
Blending Complementary Memory Systems in Hybrid Quadratic-Linear Transformers: This paper proposes Hybrid Quadratic-Linear Transformers (HQLT), which integrate KV-memory (softmax attention: precise retrieval but quadratic complexity) and FW-memory (DeltaNet/linear attention: linear complexity but coarse retrieval) as complementary memory systems. Three hybrid strategies are compared (Delayed-Streaming, Delayed-Chunk, and Synchronous), and the Synchronous variant is shown to be optimal across language modeling, retrieval, algorithmic reasoning, and RL tasks at 340M and 1.3B parameter scales.
Bootstrap Off-policy with World Model (BOOM): This paper proposes the BOOM framework, which tightly couples an online planner (MPPI) with off-policy policy learning via a bootstrap loop: the policy initializes the planner, which in turn guides policy improvement through a likelihood-free alignment loss, supplemented by a soft Q-weighted mechanism to prioritize high-return behaviors, achieving state-of-the-art performance on high-dimensional continuous control tasks.
Bootstrap Off-policy with World Model: This paper proposes the BOOM framework, which distills high-quality actions from an online planner into a policy network via a bootstrap alignment loop. By employing a likelihood-free forward KL divergence and a soft Q-weighting mechanism, BOOM effectively mitigates the actor divergence between the planner and the policy, achieving state-of-the-art performance on high-dimensional continuous control tasks.
Boundary-to-Region Supervision for Offline Safe Reinforcement Learning: This paper proposes B2R (Boundary-to-Region), a framework that addresses the symmetric conditioning fallacy of sequence models in offline safe RL by introducing Cost-to-Go (CTG) Realignment. It converts sparse boundary supervision into dense safe-region supervision, satisfying safety constraints on 35 out of 38 safety-critical tasks.
Certifying Concavity and Monotonicity in Games via Sum-of-Squares Hierarchies: This paper proves that verifying concavity and monotonicity in games with polynomial utilities and semi-algebraic strategy sets is NP-hard, and proposes two hierarchical certification schemes based on sum-of-squares (SOS) programming, each solvable in polynomial time at every level of the hierarchy.
Certifying Stability of Reinforcement Learning Policies using Generalized Lyapunov Functions: This paper proposes a Generalized Lyapunov Function framework that combines RL value functions with neural network residual terms, replacing the classical strict per-step descent requirement with a multi-step weighted descent condition to certify the stability of RL policies.
Checklists Are Better Than Reward Models For Aligning Language Models: This paper proposes Reinforcement Learning from Checklist Feedback (RLCF), which decomposes instructions into dynamically generated yes/no checklists, scores each item using an AI judge and code verifier, and trains with DPO. RLCF consistently improves Qwen2.5-7B-Instruct across 5 benchmarks and is the only method that achieves positive gains on all benchmarks (FollowBench +4pt, InFoBench +6pt, Arena-Hard +3pt).
Communicating Plans, Not Percepts: Scalable Multi-Agent Coordination with Embodied World Models: This paper proposes an "intention communication" architecture based on lightweight world models, enabling multi-agent coordination by generating and sharing future trajectory plans. The approach comprehensively outperforms end-to-end emergent communication methods in both scalability and performance.
Comparing Uniform Price and Discriminatory Multi-Unit Auctions through Regret Minimization: Under the online learning and regret minimization framework, this paper systematically compares the learning difficulty of uniform-price auctions and discriminatory auctions, proving that the two formats share identical worst-case regret rates, while under specific structural conditions the uniform-price auction admits faster learning rates.
Complexity Scaling Laws for Neural Models using Combinatorial Optimization: Using the Traveling Salesman Problem (TSP) as a case study, this paper investigates predictable scaling relationships between problem complexity (solution space size, representation space dimensionality) and model performance under fixed model capacity, revealing systematic performance trends for RL and SFT in combinatorial optimization.
Computational Hardness of Reinforcement Learning with Partial \(q^\pi\)-Realizability: This paper introduces the notion of "partial \(q^\pi\)-realizability" and proves that learning a near-optimal policy under this setting is NP-hard when using a greedy policy class, and requires exponential time under the rETH assumption when using a softmax policy class. These results bridge the theoretical gap between \(q^*\)-realizability and \(q^\pi\)-realizability.
Confounding Robust Deep Reinforcement Learning: A Causal Approach: This paper extends DQN via partial identification theory, proposing Causal DQN to learn robust policies from offline data with unobserved confounders—by optimizing a worst-case lower bound on the value function to obtain safe policies—and consistently outperforms standard DQN across 12 confounded Atari games.
Continual Knowledge Adaptation for Reinforcement Learning: This paper proposes CKA-RL, which maintains a task-specific knowledge vector for each task and employs softmax-weighted dynamic knowledge adaptation along with an adaptive knowledge merging mechanism, achieving a 4.20% overall performance gain and 8.02% forward transfer improvement across three continual RL benchmarks.
Convergence Theorems for Entropy-Regularized and Distributional Reinforcement Learning: This paper proposes the temperature decoupling gambit, proving that in entropy-regularized reinforcement learning, by decoupling the evaluation temperature from the behavioral temperature, both the policy and the return distribution converge—as the temperature tends to zero—to an interpretable, diversity-preserving optimal policy.
CORE: Constraint-Aware One-Step Reinforcement Learning for Simulation-Guided Neural Network Accelerator Design: This paper proposes CORE (Constraint-aware One-step REinforcement learning), a critic-free single-step RL framework that efficiently explores the joint hardware–mapping design space of DNN accelerators via structured distribution sampling, a scaling-graph decoder, and constraint-aware reward shaping, achieving at least 15× latency improvement across 7 DNN models.
Counteractive RL: Rethinking Core Principles for Efficient and Scalable Deep Reinforcement Learning: CoAct TD Learning challenges the random exploration paradigm of ε-greedy by selecting, with probability ε, the action that minimizes \(Q(s,a)\) (rather than a random action) to obtain high temporal-difference signals. The paper theoretically proves that this produces larger TD errors, achieves a 248% performance improvement on Atari 100K, and requires only a 2-line code change with zero additional computation.
DCcluster-Opt: Benchmarking Dynamic Multi-Objective Optimization for Geo-Distributed Data Center Workloads: This paper proposes DCcluster-Opt, an open-source high-fidelity simulation benchmark platform for geo-distributed data centers. It integrates real-world datasets (carbon intensity, electricity prices, weather, etc.) and physics-based models to support reinforcement learning research on dynamic multi-objective workload scheduling.
Decoder-Hybrid-Decoder Architecture for Efficient Reasoning with Long Generation: SambaY proposes the Gated Memory Unit (GMU) for sharing SSM token-mixing representations across layers, replacing half of the cross-attention layers in YOCO's cross-decoder with lightweight GMUs. This maintains linear prefill complexity and long-context retrieval capability while substantially improving decoding efficiency. The resulting product, Phi4-mini-Flash-Reasoning (3.8B), outperforms Phi4-mini-Reasoning on reasoning benchmarks and achieves up to 10× decoding throughput improvement in the 2K prompt + 32K generation setting.
Deep RL Needs Deep Behavior Analysis: Exploring Implicit Planning by Model-Free Agents: This paper proposes ForageWorld, a naturalistic foraging environment, and a neuroscience-inspired joint behavior-neural analysis framework, revealing that model-free RNN-based DRL agents exhibit structured, planning-like behavior through emergent dynamics—without explicit memory modules or world models.
DeepDiver: Adaptive Search Intensity Scaling via Open-Web Reinforcement Learning: DeepDiver is an RL-driven search-reasoning framework that trains LLMs for information-seeking in real open-web environments, giving rise to an emergent behavior termed Search Intensity Scaling (SIS)—enabling a 7B model to match DeepSeek-R1 (671B) on knowledge-intensive tasks.
DISCOVER: Automated Curricula for Sparse-Reward Reinforcement Learning: This paper proposes DISCOVER, a goal selection strategy for sparse-reward long-horizon RL that simultaneously balances achievability, novelty, and relevance to construct curricula directed toward a target task. The authors theoretically prove that the number of steps to reach the goal scales linearly with goal distance rather than with the volume of the search space, and demonstrate significant improvements over prior state-of-the-art exploration strategies on high-dimensional navigation and manipulation tasks.
Distribution Learning Meets Graph Structure Sampling: This paper establishes a novel connection between PAC learning of high-dimensional probabilistic graphical models and efficient counting/sampling of graph structures. By reducing the maintenance of an exponentially large expert pool to a weighted DAG sampling problem via online learning frameworks (EWA/RWM), the paper presents the first efficient agnostic learning algorithm for Bayesian networks with chordal graph skeletons, and improves the sample complexity for tree-structured distributions from \(O(nk^3/\varepsilon)\) to the optimal \(O(nk^2/\varepsilon)\).
Dynamic Regret Reduces to Kernelized Static Regret: This paper reformulates dynamic regret minimization as a static regret problem in a reproducing kernel Hilbert space (RKHS), achieving the optimal path-length-dependent bound \(\widetilde{\mathcal{O}}(\sqrt{MP_TT})\) via carefully designed shift-invariant kernels, without requiring prior knowledge of the time horizon.
Dynamics-Aligned Latent Imagination in Contextual World Models for Zero-Shot Generalization: DALI, a self-supervised context encoder, is introduced into the DreamerV3 architecture to infer latent environment parameters (e.g., gravity, friction) from interaction history. It achieves zero-shot generalization on cMDP benchmarks without retraining, outperforming ground-truth context-aware baselines by up to 96.4% on extrapolation tasks.
EgoBridge: Domain Adaptation for Generalizable Imitation from Egocentric Human Data: This paper proposes EgoBridge, a framework that uses Optimal Transport (OT) to align the joint distribution (features + actions) of human and robot data in a shared policy latent space, combined with Dynamic Time Warping (DTW) to construct pseudo-pairs, enabling cross-embodiment knowledge transfer from egocentric human data to robots, achieving up to 44% absolute improvement in success rate on real-world tasks.
Emergent World Beliefs: Exploring Transformers in Stochastic Games: This work extends the study of emergent world models in LLMs from perfect-information games (Othello, Chess) to the partial-information setting (Texas Hold'em Poker). By pre-training GPT-2 on PHH-format poker data and probing its internal activations, the paper demonstrates that the model not only learns deterministic features (hand rank recognition at ~98% accuracy) but also spontaneously develops internal representations of stochastic features (win rate/equity, correlation coefficient 0.59).
Empirical Study on Robustness and Resilience in Cooperative Multi-Agent Reinforcement Learning: Through 82,620 large-scale experiments, this work systematically investigates robustness and resilience in cooperative multi-agent RL, demonstrating that hyperparameter tuning matters more than algorithm selection, and revealing that commonly adopted practices such as parameter sharing, GAE, and PopArt are harmful under uncertainty. A set of practical hyperparameter recommendations is proposed.
Enhancing Interpretability in Deep Reinforcement Learning through Semantic Clustering: This paper proposes a Semantic Clustering Module (SCM) that combines a Feature Dimensionality Reduction (FDR) network with an adapted online VQ-VAE clustering mechanism, seamlessly integrated into the DRL training pipeline. The approach addresses the instability of t-SNE visualization and demonstrates that DRL inherently exhibits dynamic, semantics-based clustering behavior.
Exploration with Foundation Models: Capabilities, Limitations, and Hybrid Approaches: This paper systematically evaluates the zero-shot exploration capabilities of LLMs/VLMs on classic RL exploration tasks (bandits, Gridworld, Atari), identifies a knowing-doing gap in VLMs — where high-level reasoning succeeds but low-level control fails — and proposes a simple VLM-RL hybrid framework that substantially accelerates learning under idealized conditions.
Extending NGU to Multi-Agent RL: A Preliminary Study: This paper extends the single-agent NGU (Never Give Up) algorithm to multi-agent settings and conducts a systematic ablation across three design dimensions: shared replay buffer, shared novelty signal, and heterogeneous β parameters. The results show that NGU combined with a shared experience replay buffer significantly outperforms a multi-agent DQN baseline on the PettingZoo simple_tag pursuit task.
FedRAIN-Lite: Federated Reinforcement Algorithms for Improving Idealised Numerical Weather and Climate Models: This paper proposes FedRAIN-Lite, a federated reinforcement learning framework that assigns RL agents to individual latitude bands to learn local climate parameterization policies with periodic global aggregation. Evaluated on a hierarchical idealized energy balance model (EBM), DDPG with this framework reduces area-weighted RMSE by over 50% in tropical and mid-latitude regions, providing a viable pathway for scaling RL to full-scale GCMs.
Feel-Good Thompson Sampling for Contextual Bandits: a Markov Chain Monte Carlo Showdown: This paper presents the first systematic empirical evaluation of Feel-Good Thompson Sampling (FG-TS) and its smoothed variant SFG-TS under approximate posteriors, spanning linear, logistic, and neural contextual bandit settings across fourteen benchmarks. The study finds that FG-TS outperforms standard TS when exact posteriors are available (linear/logistic), but degrades in neural bandits, revealing a critical trade-off between optimistic bias and sampling noise.
Financial Instruction Following Evaluation (FIFE): FIFE is a challenging instruction-following benchmark for financial analysis tasks, comprising 88 manually authored complex prompts and 40+ chainable, domain-specific verifiable constraints. It evaluates 53 models under both strict and loose modes, revealing that even the strongest open-weight model (76.1% strict) fails to perfectly follow complex financial instruction requirements.
Finite-Sample Analysis of Policy Evaluation for Robust Average Reward Reinforcement Learning: This work provides the first finite-sample complexity analysis for policy evaluation in robust average reward MDPs. By constructing a carefully designed semi-norm, it proves that the robust Bellman operator is a contraction, and combines this with a truncated Multi-Level Monte Carlo (MLMC) estimator to achieve finite expected sample complexity, ultimately attaining an order-optimal sample complexity of \(\tilde{\mathcal{O}}(\epsilon^{-2})\).
Forecasting in Offline Reinforcement Learning for Non-stationary Environments: This paper proposes Forl, a framework that fuses multimodal candidate states generated by a conditional diffusion model with shift predictions from a zero-shot time-series foundation model via Dimension-wise Closest Matching (DCM). Forl enables deployment-time adaptation to non-stationary observation functions that shift episodically, without retraining, achieving substantial average score improvements on D4RL benchmarks.
Foundation Models as World Models: A Foundational Study in Text-Based GridWorlds: This paper systematically evaluates foundation models (LLMs) as zero-shot world models (FWM) and direct decision-making agents (FA) in text-based gridworlds, revealing complementary advantages of the two strategies in deterministic and stochastic environments.
Gaussian Process Upper Confidence Bound Achieves Nearly-Optimal Regret in Noise-Free Gaussian Process Bandits: This paper proves that GP-UCB achieves nearly-optimal regret in the noise-free GP bandit problem, establishing for the first time \(O(1)\) constant cumulative regret under the SE kernel and \(O(1)\) cumulative regret under the Matérn kernel (when \(d < \nu\)), thereby closing a long-standing gap between the theory and practice of GP-UCB.
Generalized Linear Bandits: Almost Optimal Regret with One-Pass Update: This paper proposes the GLB-OMD algorithm, which, for the first time in the generalized linear bandit (GLB) setting, simultaneously achieves a near-optimal regret bound of \(\mathcal{O}(\log T\sqrt{T/\kappa_*})\) and \(\mathcal{O}(1)\) per-round time and space complexity. The key technical contribution is constructing tight confidence sets for an online mirror descent (OMD) estimator via mix loss.
Generalizing Verifiable Instruction Following: This paper introduces IFBench, a benchmark for evaluating generalization in precise instruction following, demonstrating that current SOTA models severely overfit to the 25 constraint templates of IFEval. It further proposes IF-RLVR, a training method based on GRPO with verifiable rewards, which significantly improves both in-domain and out-of-domain instruction following performance.
Global Convergence for Average Reward Constrained MDPs with Primal-Dual Actor-Critic: This paper proposes the Primal-Dual Natural Actor-Critic (PDNAC) algorithm, which achieves, for the first time, a global convergence rate of \(\tilde{\mathcal{O}}(1/\sqrt{T})\) and a constraint violation rate of \(\tilde{\mathcal{O}}(1/\sqrt{T})\) for average reward constrained MDPs under general parameterized policies, matching the theoretical lower bound.
Gradient-Variation Online Adaptivity for Accelerated Optimization with Hölder Smoothness: This paper develops gradient-variation adaptive online learning algorithms for Hölder smooth function classes, achieving regret that smoothly interpolates between the smooth and non-smooth extremes. Via online-to-batch conversion, it provides the first universal method for strongly convex optimization that attains accelerated convergence in the smooth case and near-optimal convergence in the non-smooth case.
GraphChain: Large Language Models for Large-scale Graph Analysis via Tool Chaining: This paper proposes GraphChain, a framework that enables LLMs to analyze large-scale graphs in a progressive, human-like exploratory manner through two key components: progressive graph distillation (RL-driven tool-chain sequence generation) and structure-aware test-time adaptation (lightweight adapters conditioned on graph topology fingerprints). GraphChain achieves an average accuracy of 84.7%, surpassing the best baseline by 20.7%, and scales to graphs with up to 200,000 nodes.
Greedy Algorithm for Structured Bandits: A Sharp Characterization of Asymptotic Success / Failure: This paper provides a complete theoretical characterization of the greedy algorithm in structured bandit problems, proposing self-identifiability as a necessary and sufficient condition for the greedy algorithm to achieve sublinear regret, and extends the results to contextual bandits and the general interactive decision-making framework DMSO.
Horizon Reduction Makes RL Scalable: Through large-scale experiments involving up to one billion transitions, this paper identifies the curse of horizon—excessively long decision horizons—as the primary scalability bottleneck in offline RL, and demonstrates that horizon reduction techniques such as n-step returns and hierarchical policies substantially improve scalability. Building on this analysis, the paper proposes SHARSA, a simple yet effective method.
Human-Inspired Multi-Level Reinforcement Learning: This paper proposes RbRL-KL, which augments rating-based RL (RbRL) with a KL divergence-driven policy loss term. By leveraging failure experiences across different rating levels with varying weights to repel the current policy, RbRL-KL outperforms standard RbRL across 6 DeepMind Control environments.
Hybrid Latent Reasoning via Reinforcement Learning: HRPO proposes a hybrid latent reasoning policy optimization framework: a learnable gating mechanism progressively blends the hidden state representation from the previous step into the sampled token embeddings, enabling LLMs to leverage both discrete tokens and continuous latent representations during inference. Without requiring CoT annotations, HRPO is trained entirely via RL and outperforms baselines such as PPO and GRPO on both knowledge-intensive and STEM reasoning tasks.
Improved Regret and Contextual Linear Extension for Pandora's Box and Prophet Inequality: This paper proposes new algorithms for the online Pandora's Box problem, improving regret from \(\widetilde{O}(n\sqrt{T})\) to \(\widetilde{O}(\sqrt{nT})\) (matching the lower bound), and introduces the first contextual linear extension achieving \(\widetilde{O}(nd\sqrt{T})\) regret.
Improved Regret Bounds for GP-UCB in Bayesian Optimization: This paper proves that GP-UCB achieves \(\widetilde{O}(\sqrt{T})\) high-probability regret under the Bayesian setting (when the Matérn kernel satisfies a smoothness condition) and \(O(\sqrt{T \ln^2 T})\) for the SE kernel, closing the gap between existing upper bounds for GP-UCB and the optimal upper bounds.
Improving Planning and MBRL with Temporally-Extended Actions: This paper proposes treating action duration as an additional optimization variable in shooting-based planning and MBRL, combined with a multi-armed bandit (MAB) mechanism for automatic duration range selection. The approach significantly accelerates planning across multiple environments and solves challenging tasks that standard methods fail to handle.
Improving Retrieval-Augmented Generation through Multi-Agent Reinforcement Learning: This work models multiple components of a complex RAG pipeline (Query Rewriter, Selector, Generator) as a cooperative multi-agent system and jointly optimizes them via MAPPO, using the F1 score of the final answer as a shared reward. The proposed method outperforms existing single-module optimization approaches on multiple QA benchmarks.
Incentivizing Reasoning for Advanced Instruction-Following of Large Language Models: This paper proposes RAIF, which employs RL with rule-centric rewards to cultivate deep reasoning capabilities in LLMs for complex instructions containing And/Chain/Selection/Nested compositional constraints. A key finding is that vanilla CoT is detrimental to instruction following, as LLMs tend to shallowly paraphrase instructions rather than analyze constraint structures. RAIF addresses this through superior CoT enforcement (sample-level contrastive filtering of ineffective reasoning) and behavior cloning to control distribution shift. A 1.5B model trained with RAIF matches 8B-level performance, achieving an average improvement of 11.74% across 7 benchmarks.
Incremental Sequence Classification with Temporal Consistency: This paper imports the temporal-difference (TD) learning idea from reinforcement learning into sequence classification, proposing the TC-\(\lambda\) loss function. By requiring the predictive distributions at adjacent time steps to satisfy a temporal consistency condition, it trains incremental sequence classifiers that outperform standard cross-entropy methods on both text classification and LLM verification tasks.
Inner Speech as Behavior Guides: Steerable Imitation of Diverse Behaviors for Human-AI Coordination: Inspired by Vygotsky's theory of inner speech, this paper proposes MIMIC, a framework that uses language as an intermediate representation between perception and action. A VLM provides language scaffolding to train a CVAE that generates inner speech, which then conditions a diffusion policy to produce diverse and steerable behaviors.
Interactive and Hybrid Imitation Learning: Provably Beating Behavior Cloning: When annotation cost is measured per state rather than per trajectory, the interactive method Stagger is provably shown to surpass Behavior Cloning under the \(\mu\)-recoverability condition (suboptimality \(O(\mu H \log B / N)\) vs. \(O(RH \log B / CN)\), with significant advantage when \(\mu \ll R\)). The paper further proposes a hybrid IL algorithm, Warm-Stagger, which combines offline data with interactive annotation to achieve strictly complementary advantages from both data sources on specific MDPs.
Inverse Optimization Latent Variable Models for Learning Costs Applied to Route Problems: This paper proposes IO-LVM (Inverse Optimization Latent Variable Model), which employs a VAE-style encoder to map observed COP solutions into a latent cost space. A Fenchel-Young loss combined with black-box solvers (Dijkstra/TSP solver) ensures feasibility at the decoding stage. The model learns the distribution of cost functions from route data without agent labels, and successfully separates navigation preferences of different agents in an unsupervised manner.
Kimina Lean Server: A High-Performance Lean Server for Large-Scale Verification: This paper presents Kimina Lean Server — a high-performance Lean 4 verification server designed for large-scale reinforcement learning training. By leveraging server-side parallelization and an LRU caching mechanism, it achieves 1.5–2× speedups over existing tools and has been used to train the state-of-the-art theorem proving model Kimina-Prover.
Knowledge-based Visual Question Answer with Multimodal Processing, Retrieval and Filtering: This paper proposes Wiki-PRF, a three-stage (Processing–Retrieval–Filtering) multimodal RAG framework that trains a VLM via reinforcement learning to autonomously invoke visual tools and filter retrieved results, achieving state-of-the-art performance on E-VQA and InfoSeek.
Last Iterate Convergence in Monotone Mean Field Games: This paper proposes a KL-divergence-based proximal point (PP) method that achieves asymptotic last iterate convergence (LIC) in non-strictly monotone mean field games (MFGs), and proves that regularized mirror descent (RMD) converges to regularized equilibria at an exponential rate. The combined approximate proximal point (APP) algorithm reliably converges to non-regularized equilibria on standard benchmarks.
Learning from Demonstrations via Capability-Aware Goal Sampling: This paper proposes Cago, a method that dynamically tracks an agent's attainment capability along expert demonstration trajectories and adaptively samples intermediate goals near the capability frontier, constructing an implicit curriculum to guide learning in long-horizon, sparse-reward tasks.
Learning Human-Like RL Agents through Trajectory Optimization with Action Quantization: This paper proposes MAQ (Motion-Action Quantization), a method that discretizes human actions into a finite set of motion primitives via VQ-VAE, then performs trajectory optimization within the quantized action space to train RL agents whose behavioral patterns more closely resemble those of humans.
Learning in Stackelberg Mean Field Games: A Non-Asymptotic Analysis: This paper proposes AC-SMFG, the first single-loop Actor-Critic algorithm with non-asymptotic convergence guarantees for solving Stackelberg Mean Field Games (SMFGs), achieving a convergence rate of \(\widetilde{\mathcal{O}}(k^{-1/2})\).
Learning Interactive World Model for Object-Centric Reinforcement Learning: This paper proposes FIOC-WM, which learns the interaction structure among objects in a world model via a two-level factorization at the object and attribute levels. It trains a hierarchical policy grounded in interaction primitives, achieving more efficient policy learning and compositional generalization across multiple robot control tasks.
Learning Interestingness in Automated Mathematical Theory Formation: This paper proposes Fermat—a reinforcement learning environment that models mathematical theory formation as an MDP—and EvoAbstract, an LLM-driven evolutionary algorithm with abstraction learning, for automatically synthesizing interestingness metrics for mathematical objects. The approach substantially outperforms hard-coded baselines in elementary number theory and finite fields.
Learning Intractable Multimodal Policies with Reparameterization and Diversity Regularization: This paper proposes the Diversity-regularized Actor Critic (DrAC) algorithm, which unifies intractable multimodal policies (amortized actor and diffusion actor) under a stochastic-mapping formulation, enables direct policy gradient optimization via reparameterization without requiring probability density evaluation, and introduces a distance-based diversity regularization as an alternative to entropy regularization. DrAC demonstrates significant advantages on diversity-critical tasks such as multi-goal navigation and generative RL.
Learning Memory-Enhanced Improvement Heuristics for Flexible Job Shop Scheduling: This paper proposes MIStar—the first deep reinforcement learning (DRL)-based improvement heuristic framework for the Flexible Job Shop Scheduling Problem (FJSP). Key innovations include a directed heterogeneous disjunctive graph representation, a Memory-enhanced Heterogeneous Graph Neural Network (MHGNN), and a parallel greedy search strategy. MIStar consistently outperforms handcrafted improvement heuristics and state-of-the-art constructive DRL methods on both synthetic datasets and public benchmarks.
Learning to Clean: Reinforcement Learning for Noisy Label Correction: This paper formulates noisy label correction as a Markov Decision Process under the reinforcement learning framework, proposing RLNLC. A policy function built upon a k-nearest-neighbor embedding space determines which labels should be corrected, guided by a label consistency reward and a cross-subset alignment reward. RLNLC achieves state-of-the-art performance across multiple benchmark datasets under both instance-dependent and symmetric noise settings.
Learning to Focus: Prioritizing Informative Histories with Structured Attention Mechanisms in Partially Observable Reinforcement Learning: Two structured temporal priors—Memory-Length Prior and Gaussian Distributional Prior—are embedded into the self-attention mechanism of a Transformer world model. Under partially observable RL settings, Gaussian Attention achieves a 77% relative improvement in human-normalized score over UniZero on the Atari 100k benchmark with negligible computational overhead.
Massively Parallel Imitation Learning of Mouse Forelimb Musculoskeletal Reaching Dynamics: This work presents MIMIC-MJX, a massively parallel imitation learning pipeline for mouse forelimb musculoskeletal simulation. Leveraging JAX-accelerated PPO at 1.2 million steps/second across thousands of parallel environments, the pipeline trains physically-informed imitation learning policies. The study demonstrates that control cost regularization enables simulated muscle activity to better predict real EMG signals, and employs a Takens-theorem-based nonlinear dynamical systems approach to predict muscle activation from joint kinematics.
Mean-Field Sampling for Cooperative Multi-Agent Reinforcement Learning: This paper proposes the SUBSAMPLE-MFQ algorithm, which randomly samples \(k\) agents from \(n\) to perform mean-field Q-learning, reducing the sample complexity of multi-agent reinforcement learning from \(\text{poly}(n)\) to \(\text{poly}(k)\). The resulting optimality gap is only \(\tilde{O}(1/\sqrt{k})\) (independent of \(n\)), achieving exponential speedup over standard mean-field MARL when \(k = O(\log n)\).
Memo: Training Memory-Efficient Embodied Agents with Reinforcement Learning: This paper proposes Memo, a Transformer-based memory-augmented framework that periodically generates summary tokens to compress historical context. Memo matches or exceeds the performance of full-context Transformers while reducing the KV cache at inference time by 8–10×, and demonstrates superior generalization to long contexts as well as robustness under streaming inference.
Meta-World+: An Improved, Standardized, RL Benchmark: This paper systematically exposes how undocumented reward function inconsistencies across versions of the Meta-World benchmark distort algorithm comparisons, and releases a standardized new version, Meta-World+, which explicitly retains both V1 and V2 reward functions, introduces MT25/ML25 task sets, upgrades to the Gymnasium API, and enables fully reproducible evaluation for multi-task and meta-reinforcement learning.
MetaBox-v2: A Unified Benchmark Platform for Meta-Black-Box Optimization: MetaBox-v2 is a milestone upgrade to the Meta-Black-Box Optimization (MetaBBO) benchmark platform. It provides unified support for four learning paradigms (RL/SL/NE/ICL), reproduces 23 baseline algorithms, integrates 18 test suites (1900+ problem instances), and achieves 10–40× speedup via vectorized environments and distributed evaluation.
Mind the GAP! The Challenges of Scale in Pixel-based Deep Reinforcement Learning: This paper identifies the "bottleneck connection" between the encoder (convolutional layers \(\phi\)) and the fully connected layers (\(\psi\)) as the fundamental obstacle to scaling pixel-based deep RL networks, and proposes Global Average Pooling (GAP) — a minimal architectural change — to directly resolve this bottleneck. GAP achieves performance on par with or superior to complex methods (SoftMoE, sparse training) at substantially lower computational cost.
Mixing Expert Knowledge: Bring Human Thoughts Back to the Game of Go: This paper proposes LoGos, which applies mixed-domain expert data (Go) and general long chain-of-thought (CoT) reasoning data for cold-start fine-tuning followed by GRPO reinforcement learning, enabling a general-purpose LLM to reach professional-level Go performance while preserving strong general reasoning capabilities.
Models That Prove Their Own Correctness: This paper proposes the Self-Proving Models framework, in which a model proves the correctness of its outputs to a verifier algorithm via an interactive proof system. Two learning algorithms are introduced—Transcript Learning (TL) and Reinforcement Learning from Verifier Feedback (RLVF)—and experiments on the GCD computation task demonstrate that Annotated TL achieves 96% Verifiability.
Modulation of Temporal Decision-Making in a Deep Reinforcement Learning Agent under the Dual-Task Paradigm: DRL agents trained in a simplified Overcooked environment to perform either a single task (temporal production) or a dual task (temporal production + numerical comparison) exhibit significantly greater temporal overproduction across all four target durations in the dual-task condition—an emergent behavior that closely parallels the time overestimation phenomenon observed in human temporal perception research under dual-task paradigms.
MTL-KD: Multi-Task Learning Via Knowledge Distillation for Generalizable Neural Vehicle Routing Solver: This paper proposes MTL-KD, a multi-task learning framework based on knowledge distillation. It distills policy knowledge from multiple RL single-task teacher models into a heavy-decoder student model, achieving efficient unified solving across diverse VRP variants with superior generalization on large-scale instances.
Multi-Agent Collaboration via Evolving Orchestration: This paper proposes a "Puppeteer" multi-agent collaboration paradigm in which a centralized orchestrator learns via RL to dynamically select which agent to activate at each reasoning step. The approach simultaneously improves performance and efficiency on both closed-domain and open-domain tasks, and reveals that evolved topologies tend toward more compact cyclic structures.
Multi-Objective Reinforcement Learning with Max-Min Criterion: A Game-Theoretic Approach: This work reformulates entropy-regularized max-min multi-objective reinforcement learning as a two-player zero-sum regularized game, proposes the ERAM/ARAM algorithms with closed-form weight updates via mirror descent, and establishes global last-iterate convergence, substantially outperforming baselines across multiple MORL benchmarks.
Near-Optimal Quantum Algorithms for Computing (Coarse) Correlated Equilibria of General-Sum Games: This work presents the first quantum algorithms for computing correlated equilibria (CE) and coarse correlated equilibria (CCE) in multi-player general-sum games. By quantizing the multi-scale MWU framework and introducing a unified QRAM scheme, the paper achieves a near-optimal query complexity of \(\tilde{O}(m\sqrt{n})\) in both the number of players \(m\) and actions \(n\), along with matching quantum lower bounds.
NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation: This paper proposes NoisyRollout, a data augmentation method with zero additional training cost. During GRPO-based VLM training, it mixes rollouts from clean and moderately perturbed images to enhance policy exploration diversity. Using only 2.1K samples, it achieves state-of-the-art performance among open-source RL fine-tuned models across five out-of-domain benchmarks.
Non-convex Entropic Mean-Field Optimization via Best Response Flow: This work extends Best Response Flow from convex functional optimization to the non-convex setting, proving that under sufficiently large entropic regularization the BR operator becomes a contraction in the \(L^1\)-Wasserstein distance, thereby guaranteeing the existence of a unique global minimizer and exponential convergence for non-convex objectives.
On the Global Optimality of Policy Gradient Methods in General Utility Reinforcement Learning: This paper establishes global optimality guarantees for policy gradient methods in reinforcement learning with general utilities (RLGU): in the tabular setting, global convergence is proved via a novel gradient dominance inequality; in large-scale state-action spaces, an occupancy measure approximation algorithm PG-OMA based on maximum likelihood estimation (MLE) is proposed, whose sample complexity depends only on the dimension \(m\) of the function approximation class rather than the size of the state-action space.
Online Optimization for Offline Safe Reinforcement Learning: This paper proposes O3SRL, a framework that formalizes offline safe reinforcement learning as a minimax optimization problem. By combining an offline RL oracle with EXP3-based online optimization for adaptive Lagrange multiplier adjustment, O3SRL avoids unstable off-policy evaluation and achieves high reward under strict safety constraints.
Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning: Open Vision Reasoner (OVR) employs a two-stage training paradigm—linguistic cold start followed by large-scale multimodal RL—to effectively transfer cognitive behaviors (e.g., backtracking, verification) from language models to visual reasoning. Built on Qwen2.5-VL-7B, OVR achieves 51.8% on MathVision, the first model at this scale to surpass 50%, establishing a new state of the art among same-scale models.
Opinion: Towards Unified Expressive Policy Optimization for Robust Robot Learning: This paper proposes UEPO, a framework comprising three core components—multi-seed dynamics-aware diffusion policies, dynamic divergence regularization, and diffusion-based data augmentation—to address insufficient multimodal behavioral coverage and distribution shift in offline-to-online reinforcement learning, surpassing Uni-O4 on the D4RL benchmark.
Optimizing the Unknown: Black Box Bayesian Optimization with Energy-Based Model and Reinforcement Learning: This paper proposes REBMBO, a framework that unifies Gaussian Processes (local modeling), Energy-Based Models (EBM, global exploration), and PPO-based reinforcement learning (multi-step look-ahead) into a closed-loop Bayesian optimization system, achieving significant improvements over conventional BO methods on high-dimensional and multi-modal black-box optimization tasks.
Oryx: a Scalable Sequence Model for Many-Agent Coordination in Offline MARL: This paper proposes Oryx, a scalable sequence model algorithm for offline cooperative MARL that integrates the Retention-based Sable architecture with an autoregressive formulation of ICQ offline regularization. Through a dual-decoder that jointly outputs policies and Q-values, combined with counterfactual advantage estimation, Oryx achieves state-of-the-art performance on more than 80% of 65 datasets and demonstrates robust scalability to 50-agent scenarios.
Parameter-Free Algorithms for the Stochastically Extended Adversarial Model: This work presents the first parameter-free algorithms for the Stochastically Extended Adversarial (SEA) model, which bridges adversarial and stochastic online convex optimization. Without prior knowledge of the domain diameter \(D\) and/or the Lipschitz constant \(G\), the proposed algorithms—built upon Optimistic Online Newton Step (OONS)—achieve regret bounds comparable to those of parameter-aware methods.
Parameter Efficient Fine-tuning via Explained Variance Adaptation: This paper proposes Explained Variance Adaptation (EVA), which initializes LoRA matrices via incremental SVD on activation vectors from downstream data, provably maximizing the expected gradient signal. Combined with an adaptive rank allocation mechanism, EVA establishes a new accuracy–efficiency Pareto frontier across language generation/understanding, image classification, and reinforcement learning.
PARCO: Parallel AutoRegressive Models for Multi-Agent Combinatorial Optimization: PARCO is a framework that solves multi-agent combinatorial optimization problems efficiently via Communication Layers for inter-agent coordination, a Multiple Pointer Mechanism for parallel decoding, and a Priority-based Conflict Handler for conflict resolution.
Periodic Skill Discovery: This paper proposes Periodic Skill Discovery (PSD), a framework that maps states onto a circular latent space to naturally encode periodicity, enabling unsupervised discovery of diverse locomotion skills with varying periods.
Preference-based Reinforcement Learning beyond Pairwise Comparisons: Benefits of Multiple Options: This paper proposes the M-AUPO algorithm for preference-based reinforcement learning, leveraging the Plackett-Luce ranking model to handle multi-option comparison feedback, and provides the first theoretical proof that larger subset sizes directly improve sample efficiency.
Prompt Tuning Decision Transformers with Structured and Scalable Bandits: This paper proposes a structured prompt tuning method based on multi-armed bandits. By decomposing prompts into independent segments and leveraging a pretrained PDT as a feature extractor, the method reduces prompt search complexity from combinatorial explosion to linear scale, significantly improving inference performance of a frozen PDT backbone in multi-task offline RL.
Provable Ordering and Continuity in Vision-Language Pretraining for Generalizable Embodied Agents: This paper proposes AcTOL, which learns ordered and continuous vision-language representations via a visual-language ordering loss and a Brownian bridge constraint, without relying on rigid goal-reaching assumptions, achieving significant improvements on downstream simulated and real-world robot manipulation tasks.
Quantifying Generalisation in Imitation Learning: This paper proposes the Labyrinth benchmark environment, which achieves strict separation between training and evaluation data through controllable maze structure variations. It reveals severe deficiencies in the structural generalisation of current imitation learning methods (best method achieves only 5% success rate on the test set) and provides a systematic tool for evaluating generalisation in imitation learning.
Real-World Reinforcement Learning of Active Perception Behaviors: This paper proposes Asymmetric Advantage-Weighted Regression (AAWR), which leverages additional privileged sensors during training to estimate more accurate advantage functions, enabling efficient learning of active perception policies in the real world. AAWR outperforms all baselines across 8 manipulation tasks spanning varying degrees of partial observability.
Reasoning Gym: Reasoning Environments for Reinforcement Learning with Verifiable Rewards: This work releases Reasoning Gym, a library of 100+ procedurally generated reasoning tasks spanning algebra, arithmetic, algorithms, logic, geometry, graph theory, games, and more. Each task supports infinite data generation and parameterized difficulty control. Experiments demonstrate that RLVR training achieves significant skill transfer both within and across domains, and improves performance on external benchmarks such as MATH and GSM8K.
Reinforcement Learning for Long-Horizon Multi-Turn Search Agents: This paper demonstrates that a 14B-parameter search agent trained with RL can surpass frontier models on legal document retrieval (85% vs. GPT o3's 81%) through multi-turn interaction, enabled by a carefully designed segmented reward structure and a sufficiently long interaction horizon.
Reinforcement Learning Teachers of Test Time Scaling: This paper proposes the Reinforcement Learning Teacher (RLT) framework, which provides both the problem and the answer to a teacher model and trains it to generate effective explanatory reasoning chains rather than solving problems from scratch. This enables a 7B-parameter teacher to produce distillation data superior to that generated by models orders of magnitude larger.
Reinforcement Learning with Action Chunking: This paper proposes Q-chunking, which extends action chunking from imitation learning to TD-based reinforcement learning by running RL directly over a "chunked" action space, thereby improving exploration and sample efficiency in long-horizon sparse-reward tasks.
RePIC: Reinforced Post-Training for Personalizing Multi-Modal Language Models: This paper proposes RePIC, the first reinforcement learning-based post-training framework for multimodal large language models targeting personalized image captioning, which significantly outperforms SFT-based methods in multi-concept scenarios.
Retrosynthesis Planning via Worst-path Policy Optimisation in Tree-structured MDPs: This work reformulates retrosynthesis planning as a worst-path optimisation problem in tree-structured MDPs — the value of a synthesis tree is determined by its weakest path, since any dead-end path renders the entire tree invalid. The proposed method, InterRetro, optimises this worst-path objective via weighted self-imitation learning, achieving 100% success rate on Retro*-190, reducing path length by 4.9%, and attaining 92% of full performance with only 10% of training data.
Reward-Aware Proto-Representations in Reinforcement Learning: This paper systematically develops the theoretical foundations of the Default Representation (DR)—deriving DP and TD learning algorithms, analyzing the feature space structure, and proposing default features for function approximation—and demonstrates DR's reward-aware advantages over the Successor Representation (SR) across four settings: reward shaping, option discovery, exploration, and transfer learning.
Risk-Averse Constrained Reinforcement Learning with Optimized Certainty Equivalents: This paper proposes a reward-based risk-aware constrained RL framework that applies Optimized Certainty Equivalent (OCE) risk measures to both objectives and constraints, establishes parametric strong duality, and delivers a modular algorithm that wraps standard RL solvers (e.g., PPO) as a black box.
Risk-Averse Total-Reward Reinforcement Learning: This paper proposes risk-averse Q-learning algorithms (ERM-TRC and EVaR-TRC) for the undiscounted total-reward criterion (TRC). By exploiting the elicitability of ERM, the Bellman operator is reformulated as a stochastic gradient descent objective, and convergence guarantees are established.
RL Tango: Reinforcing Generator and Verifier Together for Language Reasoning: Tango proposes a framework that alternately trains a generator and a verifier via RL — the verifier is a generative process-level LLM that evaluates reasoning step by step in natural language, trained solely with outcome-level correctness rewards (no step-level annotations), and mutually reinforced through co-evolution with the generator. On 7B/8B-scale models, Tango achieves SOTA, with a 100% relative improvement over vanilla GRPO on AIME 2025.
Robot-R1: Reinforcement Learning for Enhanced Embodied Reasoning in Robotics: Robot-R1 proposes training large vision-language models (LVLMs) via reinforcement learning (GRPO) for embodied reasoning. By casting next keystate prediction as multiple-choice questions and optimizing reasoning paths with RL, a 7B-parameter model surpasses GPT-4o on low-level control reasoning tasks.
Robust Adversarial Reinforcement Learning in Stochastic Games via Sequence Modeling: This paper proposes CART (Conservative Adversarially Robust Decision Transformer), the first method to enhance the adversarial robustness of Decision Transformers in stochastic games. By modeling stage games and estimating NashQ values, CART addresses the over-optimism of ARDT under stochastic state transitions, achieving more accurate minimax value estimation and superior worst-case returns.
Robust and Diverse Multi-Agent Learning via Rational Policy Gradient: This paper proposes the Rationality-Preserving Optimization (RPO) framework and the Rational Policy Gradient (RPG) algorithm. By introducing manipulator agents and opponent shaping techniques, RPG eliminates suicidal behavior induced by adversarial optimization in both cooperative and general-sum games, while simultaneously achieving policy robustness and diversity.
RoiRL: Efficient, Self-Supervised Reasoning with Offline Iterative Reinforcement Learning: This paper proposes RoiRL, a lightweight self-supervised reasoning framework based on offline iterative reinforcement learning. By replacing online RL (e.g., TTRL) with a weighted log-likelihood objective, RoiRL enables self-improvement of LLM reasoning capabilities without requiring a reference model or ground-truth labels, achieving 2.5× faster training with superior performance.
Router-R1: Teaching LLMs Multi-Round Routing and Aggregation via Reinforcement Learning: Router-R1 frames multi-LLM routing and aggregation as a sequential decision-making process, employing an LLM itself as the router to interleave think and route actions. Trained via PPO with a triple reward covering format, correctness, and cost, Router-R1 outperforms all router baselines across 7 QA benchmarks and generalizes to previously unseen LLMs.
Sample-Efficient Tabular Self-Play for Offline Robust Reinforcement Learning: This paper proposes the RTZ-VI-LCB algorithm for offline robust two-player zero-sum Markov games (RTZM G). By combining pessimistic robust value iteration with Bernstein-style penalties, it achieves a near-optimal sample complexity of \(O(C_r^* \cdot H^4 \cdot S \cdot (A+B) / \varepsilon^2)\), significantly improving upon the prior best result of \(O(H^5 \cdot S^2 \cdot AB / \varepsilon^2)\) in terms of dependence on both the state space and the action space.
Sample Complexity of Distributionally Robust Average-Reward Reinforcement Learning: This paper establishes the first finite-sample convergence guarantees for distributionally robust average-reward reinforcement learning (DR-AMDP), proposing two algorithms (discount reduction and anchoring) that achieve near-optimal sample complexity of \(\widetilde{O}(|S||A|t_{\mathrm{mix}}^2\varepsilon^{-2})\) under both KL and \(f_k\)-divergence uncertainty sets.
Scalable Neural Incentive Design with Parameterized Mean-Field Approximation: This paper proposes the AMID algorithm, which formalizes the multi-agent incentive design (ID) problem as a parameterized mean-field game (PMFG), proves that the finite-\(N\)-agent objective approximates the infinite-population limit at a rate of \(\mathscr{O}(1/\sqrt{N})\), and achieves substantial revenue improvements across multiple auction settings.
Scalable Policy-Based RL Algorithms for POMDPs: This paper proposes approximating POMDPs as finite-state Superstate MDPs (where states are truncated histories), derives a tighter upper bound on the optimal value function gap (decaying exponentially with history length), and provides the first finite-time convergence guarantees for standard TD learning combined with policy optimization under non-Markovian sampling.
Self-Improving Embodied Foundation Models: This paper proposes a two-stage post-training framework for embodied foundation models: Stage 1 performs supervised fine-tuning via behavior cloning and steps-to-go prediction; Stage 2 leverages the resulting self-reward function and success detector for online RL self-improvement. Using only 1–3% additional data, the method achieves over 1.5× improvement in success rate and, for the first time, demonstrates a robot autonomously acquiring novel skills beyond the distribution of imitation data.
Sequential Monte Carlo for Policy Optimization in Continuous POMDPs: This paper proposes a nested Sequential Monte Carlo (SMC) algorithm grounded in non-Markovian Feynman-Kac models for policy optimization in continuous POMDPs, naturally capturing the value of information gathering without hand-crafted heuristics.
Sequential Multi-Agent Dynamic Algorithm Configuration: This paper proposes Seq-MADAC, a framework that models multi-hyperparameter dynamic configuration as a contextual sequential multi-agent MDP. By exploiting inherent inter-parameter dependencies via a Sequential Advantage Decomposition Network (SADN), it outperforms existing MARL methods on multi-objective optimization algorithm configuration.
Shift Before You Learn: Enabling Low-Rank Representations in Reinforcement Learning: This paper reveals that the successor measure in reinforcement learning is not intrinsically approximately low-rank, but a "shifted successor measure"—obtained by skipping the first few transition steps—naturally exhibits low-rank structure. A novel Type II Poincaré inequality is introduced to quantify the required shift, providing finite-sample theoretical guarantees and practical improvements for goal-conditioned RL.
Simultaneous Swap Regret Minimization via KL-Calibration: This paper introduces KL-Calibration as a stronger calibration measure, establishes its equivalence to the swap regret of log loss, and achieves a simultaneous swap regret bound of \(\tilde{\mathcal{O}}(T^{1/3})\) via non-uniform discretization and a novel randomized rounding scheme, covering a broader class of proper losses than prior work.
Solving Continuous Mean Field Games: Deep Reinforcement Learning for Non-Stationary Dynamics: This paper proposes the DEDA-FP algorithm, which for the first time simultaneously learns Nash equilibrium policies and population distributions in non-stationary mean field games (MFGs) with continuous state/action spaces. By combining deep RL for best response computation, supervised learning for mean policy representation, and conditional Normalizing Flow for modeling time-varying population distributions, DEDA-FP achieves over 10× sampling efficiency compared to existing methods.
Solving Neural Min-Max Games: The Role of Architecture, Initialization & Dynamics: This paper provides the first convergence guarantees for zero-sum games parameterized by two-layer neural networks, proving that under sufficient overparameterization, random Gaussian initialization, and alternating gradient descent-ascent (AltGDA), the dynamics converge to an \(\epsilon\)-approximate Nash equilibrium with high probability.
Spatial-Aware Decision-Making with Ring Attractors in Reinforcement Learning Systems: This paper integrates ring attractor models from neuroscience into action selection in deep reinforcement learning (DRL). By mapping actions to spatial positions on a ring and injecting Gaussian signals encoding Q-values and uncertainty, the proposed approach achieves a 53% improvement over baseline on Atari 100K.
STAIR: Addressing Stage Misalignment through Temporal-Aligned Preference Reinforcement Learning: This paper identifies and formalizes the "stage misalignment" problem in Preference-based Reinforcement Learning (PbRL)—wherein comparing behavior segments from different task stages produces uninformative feedback—and proposes STAIR, a method that learns temporal distances via contrastive learning to approximate stage discrepancy. By employing a quadrilateral distance metric for stage-aligned query selection, STAIR substantially outperforms existing PbRL methods on multi-stage tasks.
Strategic Costs of Perceived Bias in Fair Selection: This paper employs a game-theoretic model to reveal a "perception-driven bias" mechanism: in purely merit-based selection systems, inter-group differences in perceived post-selection value lead to rational effort disparities, thereby systematically propagating inequality within ostensibly fair processes.
Structural Information-based Hierarchical Diffusion for Offline Reinforcement Learning: This paper proposes SIHD, a framework that leverages structural information (structural entropy) extracted from historical trajectories to adaptively construct multi-scale diffusion hierarchies, replaces local reward prediction with structural information gain as the conditional guidance signal, and introduces structural entropy regularization to encourage exploration of sparse states in offline data. SIHD achieves up to 12.6% improvement in decision-making performance on the D4RL benchmark.
Structured Reinforcement Learning for Combinatorial Decision-Making: This paper proposes Structured Reinforcement Learning (SRL), which embeds a combinatorial optimization solver as a differentiable layer within the actor of an actor-critic framework. End-to-end gradient propagation is achieved via Fenchel-Young loss with Gaussian perturbations, enabling purely online learning without expert demonstrations. SRL matches imitation learning and outperforms unstructured RL by up to 92% across six industrial-scale combinatorial decision-making problems.
Succeed or Learn Slowly: Sample Efficient Off-Policy Reinforcement Learning for Mobile App Control: This paper proposes the SoLS algorithm, which achieves sample-efficient RL fine-tuning of foundation models for mobile app control through an asymmetric policy update mechanism (aggressive learning on success, conservative regularization on failure) combined with Success Transition Replay (STR), attaining a 51.3% success rate on AndroidWorld.
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution: This work is the first to apply reinforcement learning (RL) to real-world software engineering tasks (GitHub PR/Issue resolution), training Llama-3.3-70B exclusively with a rule-based sequence-similarity reward. It achieves a 41.0% resolve rate on SWE-bench Verified (SOTA among medium-scale models). Notably, although RL training is conducted solely on issue-solving data, it elicits emergent generalization in out-of-domain tasks including code reasoning, mathematics, and general language understanding.
Teaching Language Models to Evolve with Users: Dynamic Profile Modeling for Personalized Alignment: This paper formulates personalized dialogue alignment as a multi-turn Markov Decision Process and proposes the RLPA framework, enabling LLMs to dynamically infer and maintain user profiles through online interaction with simulated users, and to generate personalized responses accordingly.
Temporal-Difference Variational Continual Learning: This paper proposes the TD-VCL objective, which reformulates the learning target in Variational Continual Learning (VCL) as a weighted combination of multiple past posterior estimates. This reformulation reveals a deep connection to temporal-difference (TD) methods in reinforcement learning, and effectively mitigates the progressive accumulation of approximation errors by "spreading" regularization pressure across multiple historical posteriors.
TensorRL-QAS: Reinforcement Learning with Tensor Networks for Improved Quantum Architecture Search: This work proposes TensorRL-QAS, a framework that warm-starts reinforcement learning-based quantum architecture search (RL-QAS) using tensor networks (MPS/DMRG), achieving up to 10× reduction in circuit depth and CNOT gate count, and up to 98% acceleration in training time, thereby effectively addressing the scalability bottleneck of RL-QAS on large-scale quantum systems.
The Burden of Interactive Alignment with Inconsistent Preferences: This paper models user interactions with engagement-driven algorithms as a multi-leader single-follower Stackelberg game, establishing a critical planning-horizon threshold: users whose effective horizon exceeds this threshold can align the algorithm to their interests, while those below it are instead aligned to the algorithm's objectives. The paper further demonstrates that introducing low-cost signals (e.g., an extra click) can substantially reduce the burden of alignment.
The Path Not Taken: RLVR Provably Learns Off the Principals: This paper proposes the Three-Gate Theory to explain the apparent sparsity of parameter updates in RLVR, demonstrating that RLVR learns along off-principal directions in weight space — a fundamentally different optimization mechanism from SFT — and that directly transplanting SFT-era PEFT methods to RLVR is therefore flawed.
The Physical Basis of Prediction: World Model Formation in Neural Organoids via an LLM-Generated Curriculum: This paper proposes a framework for studying world model formation in human neural organoids, comprising three progressively complex virtual environments (conditioned avoidance, predator–prey, Pong) and a meta-learning approach in which an LLM automatically generates experimental protocols, complemented by a multi-scale biophysical evaluation strategy to quantify the physical basis of biological learning.
The World Is Bigger! A Computationally-Embedded Perspective on the Big World Hypothesis: This paper formalizes the Big World Hypothesis from a computationally-embedded perspective, proves that agents embedded in universal-local environments are inherently capacity-constrained, proposes interactivity as a computational measure of continual adaptability, and empirically demonstrates that deep nonlinear networks fail to maintain interactivity while deep linear networks improve interactivity as capacity increases.
Thompson Sampling for Multi-Objective Linear Contextual Bandit: This paper proposes MOL-TS—the first multi-objective linear contextual bandit Thompson Sampling algorithm with worst-case Pareto regret guarantees. By introducing the concept of "effective Pareto optimal arms" and an optimistic sampling strategy, MOL-TS achieves a regret upper bound of \(\widetilde{O}(d^{3/2}\sqrt{T})\), with the number of objectives \(L\) contributing only an \(O(\log L)\) factor.
Thompson Sampling in Function Spaces via Neural Operators: This paper extends Thompson Sampling (TS) from finite-dimensional parameter spaces to infinite-dimensional function spaces, leveraging neural operators as approximate samplers of Gaussian process posteriors to efficiently solve functional optimization problems involving partial differential equations (PDEs).
Time Reversal Symmetry for Efficient Robotic Manipulations in Deep Reinforcement Learning: This paper proposes the TR-DRL framework, which exploits time reversal symmetry in robotic manipulation tasks—via trajectory reversal augmentation (for fully reversible transitions) and time-reversal-guided potential-based reward shaping (for partially reversible transitions)—to significantly improve sample efficiency and final performance of DRL on paired tasks (e.g., door opening/closing).
To Distill or Decide? Understanding the Algorithmic Trade-off in Partially Observable Reinforcement Learning: Using a theoretical framework (perturbed Block MDP) and controlled locomotion experiments, this paper systematically investigates the algorithmic trade-off between privileged expert distillation and standard RL (without privileged information) in partially observable RL, finding that the trade-off is primarily governed by the stochasticity of latent state dynamics.
Towards Provable Emergence of In-Context Reinforcement Learning: This paper theoretically proves that the globally optimal parameters of a Transformer pretrained via standard RL objectives can implement in-context temporal difference (TD) learning, providing the first provable theoretical foundation for the in-context RL (ICRL) phenomenon.
Tractable Multinomial Logit Contextual Bandits with Non-Linear Utilities: This work presents ONL-MNL, the first computationally tractable and statistically optimal algorithm for the MNL contextual bandit problem under non-linear utility functions (including neural networks), achieving \(\widetilde{\mathcal{O}}(\sqrt{T})\) regret without relying on NTK assumptions.
Training Language Models to Reason Efficiently: By incorporating a length penalty term into the RL reward—multiplying the correctness reward by \((1 - \alpha \cdot \sigma(\text{norm\_len}))\)—and using a single hyperparameter \(\alpha\) to control the token–accuracy trade-off curve, this work achieves a 50% reduction in token usage with less than 5% accuracy degradation on 7B reasoning models after only 100 RL training steps.
TRiCo: Triadic Game-Theoretic Co-Training for Robust Semi-Supervised Learning: This paper proposes TRiCo, a framework that reformulates semi-supervised learning as a three-player Stackelberg game among a teacher, two student classifiers, and an adversarial generator. It replaces confidence-based thresholding with mutual information for pseudo-label selection and employs a meta-learning teacher to adaptively regulate training dynamics, achieving state-of-the-art performance under low-label regimes.
Trust Region Reward Optimization and Proximal Inverse Reward Optimization Algorithm: This paper proposes the TRRO theoretical framework and the PIRO practical algorithm, which guarantee monotonic improvement of reward function updates in IRL via a Minorization-Maximization procedure, achieving stability guarantees analogous to those of TRPO/PPO in forward RL.
Variance-Aware Feel-Good Thompson Sampling for Contextual Bandits: This paper proposes the FGTS-VA algorithm, which for the first time achieves a variance-aware contextual bandit algorithm based on Feel-Good Thompson Sampling. The resulting regret bound is optimal in the model dimension \(d\), matching the best variance-dependent regret bounds established by UCB-based methods.
VIKI-R: Coordinating Embodied Multi-Agent Cooperation via Reinforcement Learning: This paper introduces VIKI-Bench, the first hierarchical benchmark for embodied multi-agent cooperation, comprising three evaluation levels—agent activation, task planning, and trajectory perception—and proposes VIKI-R, a two-stage training framework combining CoT-supervised fine-tuning with multi-level reward reinforcement learning. The framework achieves significant improvements over baselines across diverse robot morphologies and multi-view visual observations, with combinatorial coordination patterns emerging during the RL stage.
VolleyBots: A Testbed for Multi-Drone Volleyball Game Combining Motion Control and Strategic Play: This paper presents VolleyBots, a multi-drone volleyball competition testbed that integrates cooperative-adversarial gameplay, turn-based interaction, and agile 3D motion control. Built on Isaac Sim, it establishes a task curriculum from single-agent training to multi-agent competition. A hierarchical policy achieves a 69.5% win rate on the 3v3 task, with demonstrated zero-shot sim-to-real transfer.
When Can Model-Free Reinforcement Learning be Enough for Thinking?: This paper proposes the Thought MDP formalism to characterize the conditions under which "thinking" behavior emerges under model-free RL: policy initialization is the decisive factor; thinking actions are equivalent to the agent performing one step of policy improvement before acting; and open-source LLMs satisfy the necessary conditions for thinking to emerge.
When Less Language is More: Language-Reasoning Disentanglement Makes LLMs Better Multilingual Reasoners: Inspired by cognitive neuroscience (the relative independence of reasoning and language processing in the human brain), this work identifies and removes language-specific components in the activation space of LLMs to disentangle language from reasoning, achieving consistent improvements in multilingual reasoning performance without any training.
Zero-Shot Context Generalization in Reinforcement Learning from Few Training Contexts: This paper proposes the Context-Enhanced Bellman Equation (CEBE) and Context Sample Enhancement (CSE), which leverage first-order derivative information of environment dynamics and reward functions with respect to context parameters to achieve zero-shot generalization to unseen contexts when training is restricted to a single context.
Zeroth-Order Optimization Finds Flat Minima: This paper provides the first theoretical proof that standard zeroth-order optimization (two-point gradient estimation) exhibits an implicit regularization effect—converging to flat minima that minimize the Hessian trace—with a convergence complexity of \(T = \mathcal{O}(d^4/\epsilon^2)\) established under convexity and sufficient smoothness conditions.