🎮 Reinforcement Learning¶
🔬 ICLR2026 · 400 paper notes
📌 Same area in other venues: 📷 CVPR2026 (23) · 💬 ACL2026 (46) · 🧪 ICML2026 (110) · 🤖 AAAI2026 (58) · 🧠 NeurIPS2025 (140) · 📹 ICCV2025 (7)
🔥 Top topics: Reinforcement Learning ×154 · Reasoning ×39 · LLM ×32 · Adversarial Robustness ×30 · Agents ×30
- 3D-aware Disentangled Representation for Compositional Reinforcement Learning
-
This work extends the structured decomposition of "object attributes \(\rightarrow\) discrete blocks" from 2D to 3D multi-view space. By utilizing a policy network with block-level cross-attention for goal-conditioned reinforcement learning, it enables a robot to stably push objects to target positions even under unseen attribute combinations and novel viewpoints.
- A\(^2\)Search: Ambiguity-Aware Question Answering with Reinforcement Learning
-
A\(^2\)Search proposes an annotation-free automatic pipeline to mine multiple valid answers for "ambiguous questions" from existing QA data. By employing a multi-answer friendly AnsF1 reward for GRPO reinforcement learning, a 7B model outperforms strong 32B baselines in multi-hop QA with only a single rollout.
- A Hierarchical Circuit Symbolic Discovery Framework for Efficient Logic Optimization
-
HIS utilizes a "hierarchical symbolic tree" to distill the layer-wise message passing of GNNs into a lightweight, interpretable symbolic scoring function. It "generates" this tree end-to-end using a structure-aware Transformer and group-advantage PPO to accurately and rapidly identify invalid transformations in logic optimization (LO) for chip design. Compared to state-of-the-art (SOTA) GNN inference, it is approximately 296× faster. When integrated into the Mfs2 heuristic, it reduces average runtime by 27.22% while further reducing circuit size by 6.95%.
- A Reward-Free Viewpoint on Multi-Objective Reinforcement Learning
-
This paper introduces the Forward-Backward (FB) framework from Reward-Free Reinforcement Learning (RFRL) into Multi-Objective Reinforcement Learning (MORL) for the first time. It proposes MORL-FB, which utilizes preference-guided exploration to construct latent vectors \(z\) relevant to MORL tasks and incorporates an auxiliary Q-loss. This approach enables a preference-conditioned policy to significantly outperform SOTA methods like PD-MORL and Q-Pensieve on MO-Gymnasium with higher sample efficiency.
- A Unifying View of Coverage in Linear Off-Policy Evaluation
-
This paper proposes a new coverage parameter—feature-dynamics coverage, providing a novel finite-sample analysis of the classic LSTDQ algorithm through an instrumental variable perspective, unifying various fragmented coverage definitions in linear off-policy evaluation.
- AbstRaL: Augmenting LLMs' Reasoning by Reinforcing Abstract Thinking
-
The authors propose AbstRaL, which utilizes Reinforcement Learning (RL) to teach LLMs mathematical abstraction—replacing specific numbers/names with symbolic variables and extracting general formulas. These abstractions are then processed by a symbolic solver to derive answers. AbstRaL almost entirely eliminates performance degradation caused by distribution shifts on GSM perturbation benchmarks and shows implicit improvements in OOD mathematical and general reasoning tasks.
- Accelerated Learning with Linear Temporal Logic using Differentiable Simulation
-
This paper bridges the gap between Linear Temporal Logic (LTL) specifications and differentiable physics simulators for the first time. By applying "soft-label" relaxation to the discrete transitions of the automaton, the authors derive rewards and state representations that are differentiable with respect to states and actions. This allows first-order gradient algorithms (SHAC/AHAC) to learn efficiently directly from formal specifications, doubling the training speed and returns compared to discrete baselines on contact-rich continuous control tasks.
- Accelerating Diffusion Planners in Offline RL via Reward-Aware Consistency Trajectory Distillation
-
RACTD integrates reward optimization objectives directly into the consistency trajectory distillation process. Using a pretrained diffusion teacher planner and an independently trained noise-free reward model, it distills a single-step sampling student planner. It outperforms the previous SOTA by 9.7% on average in D4RL while being up to 142 times faster in inference than the diffusion teacher.
- Ada-Diffuser: Latent-Aware Adaptive Diffusion for Decision-Making
-
Ada-Diffuser explicitly incorporates "time-evolving hidden contexts (wind, goals, skills)" into diffusion-based decision models. It theoretically demonstrates that latent variables can be identified using a small temporal block of only 4 adjacent observations. By employing a "denoising-refinement" mechanism and zig-zag sampling, the model performs online latent inference and planning/control, consistently outperforming existing diffusion planners and latent context baselines across 23 settings in 8 environments.
- Adaptive Scaling of Policy Constraints for Offline Reinforcement Learning
-
Addressing the issue where policy constraint strength (the ratio between RL and Behavior Cloning) in offline RL must be manually tuned for each dataset, this paper proposes ASPC: treating the scaling factor \(\alpha\) in TD3+BC as a learnable parameter. By using second-order differentiable bilevel optimization to dynamically adjust it during training—stabilized by constraining the rates of change for Q-values and BC loss—the method outperforms state-of-the-art (SOTA) results requiring per-dataset grid searches using only a single set of hyperparameters across 39 D4RL datasets, achieving a 35% average improvement over the baseline.
- ADM-v2: Pursuing Full-Horizon Roll-out in Dynamics Models for Offline Policy Learning and Evaluation
-
ADM-v2 structurally decouples the starting state of the "Any-step Dynamics Model" from the GRU loop. Combined with the parallel any-step roll-out algorithm PARoll, it enables dynamics models to reliably execute full-horizon roll-outs, achieving SOTA results in both offline policy evaluation (OPE) and offline policy optimization on D4RL and NeoRL.
- All Roads Lead to Likelihood: The Value of Reinforcement Learning in Fine-Tuning
-
This paper explains why the "two-stage reward model + online RL" approach in language model fine-tuning often outperforms direct offline maximum likelihood from the perspectives of information geometry, controlled experiments, and complexity intuition. The core conclusion is that the value of RL lies not in creating new information, but in using an easier-to-learn verifier to constrain policy search to a small class of generators induced by simple rewards.
- AlphaSAGE: Structure-Aware Alpha Mining via GFlowNets for Robust Exploration
-
AlphaSAGE reformulates formulaic alpha mining in quantitative stock selection from "Reinforcement Learning maximizing expected return" to "Generative Flow Networks (GFlowNets) sampling proportional to rewards." By incorporating an RGCN structural encoder and multifaceted dense rewards, the method discovers a collection of alpha factors that are simultaneously predictive, low-correlated, and structurally novel, significantly outperforming existing RL/GA/LLM baselines across Chinese and US stock markets.
- AMPED: Adaptive Multi-objective Projection for balancing Exploration and skill Diversification
-
The AMPED framework is proposed to balance gradient conflicts between exploration (Entropy + RND) and skill diversity (AnInfoNCE) using gradient surgery (PCGrad) during the skill pre-training phase. In the fine-tuning phase, an SAC-based skill selector adaptively chooses optimal skills. The method outperforms SBRL baselines such as DIAYN, CeSD, and CIC on Maze and URLB benchmarks.
- Analysis of Approximate Linear Programming Solution to Markov Decision Problem with Log Barrier Function
-
This paper uses a log-barrier function to rewrite the Linear Programming (LP) formulation of MDPs from an inequality-constrained problem into an unconstrained strongly convex objective \(f_\eta\). The authors prove a linear error bound between the approximate optimal Q-function and the barrier parameter \(\eta\), show exponential convergence for gradient descent, and design Log-barrier DQN / DDPG that eliminates the need for target networks.
- APC-RL: Exceeding Data-Driven Behavior Priors with Adaptive Policy Composition
-
APC employs a "learning-free arbitrator selector" to adaptively switch between multiple Normalizing Flow data priors and a prior-free actor. This approach accelerates learning when demonstrations are aligned and bypasses priors when they are suboptimal or misaligned, thereby "exceeding" the performance upper bound of the demonstration data itself.
- ARM-FM: Automated Reward Machines via Foundation Models for Compositional Reinforcement Learning
-
The ARM-FM framework is proposed to automatically generate Language-Aligned Reward Machines (LARM) from natural language task descriptions using foundation models (e.g., GPT-4o). These include automata structures, executable label functions, and natural language descriptions for each state. This provides compositional dense reward signals for RL agents, solving sparse-reward long-horizon tasks in environments like MiniGrid, Craftium (3D Minecraft), and Meta-World where standard RL fails, while achieving zero-shot task generalization.
- Asynchronous Policy Gradient Aggregation for Efficient Distributed Reinforcement Learning
-
This paper adapts normalized implicit gradient transport (NIGT) into an asynchronously aggregatable distributed policy gradient algorithm. It proposes Rennala NIGT for homogeneous environments and Malenia NIGT for heterogeneous environments. Both theoretical complexity and MuJoCo experiments demonstrate that these methods better utilize fast workers, handle slow communication, and manage heterogeneous environments compared to AFedPG.
- Automating the Refinement of Reinforcement Learning Specifications
-
The AUTOSPEC framework is proposed to diagnose "undearnable policies due to coarse logic specifications" as failures on specific edges of an abstraction graph. It automatically refines specifications using four soundness-preserving operations (modifying predicates, inserting landmarks, splitting start regions, and finding alternative paths), enabling existing specification-guided RL algorithms to solve previously unsolvable tasks.
- AutoQD: Automatic Discovery of Diverse Behaviors with Quality-Diversity Optimization
-
AutoQD is proposed to embed policy occupancy measures into a finite-dimensional space via Random Fourier Features (RFF), followed by dimensionality reduction using weighted PCA to obtain behavior descriptors (BD). This achieves QD optimization without manual BD design and consistently outperforms manual BDs and existing unsupervised QD methods across six continuous control tasks.
- AutoTool: Automatic Scaling of Tool-Use Capabilities in RL via Decoupled Entropy Constraints
-
This paper proposes a reinforcement learning strategy with Decoupled Adaptive Entropy Constraints, enabling LLMs to automatically switch between long and short reasoning modes based on problem difficulty in tool-calling tasks. It improves accuracy by 9.8% while reducing inference token overhead by approximately 81%.
- AWM: Accurate Weight-Matrix Fingerprint for Large Language Models
-
The paper proposes AWM, a training-free LLM weight matrix fingerprinting method. It utilizes the Linear Assignment Problem (LAP) to recover permutations and sign flips of the embedding layer, followed by unbiased CKA to eliminate the impact of orthogonal transformations on Q/K matrices. It achieves a perfect AUC (1.0) across 150 LLM pairs and remains robust against six types of post-training—including SFT, continued pre-training (5.5T tokens), RL, multi-modal expansion, pruning, and upcycling—within 30 seconds.
- Balancing the Experts: Unlocking LoRA-MoE for GRPO via Mechanism-Aware Rewards
-
To address the issues of routing collapse and low expert utilization when using GRPO for reinforcement fine-tuning of LoRA-MoE, this paper proposes RO-GRPO. It converts internal routing statistics (entropy + load variance) collected during training into a scalar reward, which is directly integrated into the total GRPO reward. Without auxiliary losses, architectural changes, or additional training stages, this approach improves mathematical reasoning accuracy while making expert routing more balanced and confident.
- BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping
-
BAPO dynamically adjusts the upper and lower clipping boundaries \(c_{high}\) and \(c_{low}\) of PPO/GRPO during training to maintain the contribution of positive samples to the policy gradient loss at a target value \(\rho_0\). This mechanism simultaneously suppresses negative sample dominance and entropy collapse in off-policy RL, ensuring stable and efficient training for 7B/32B reasoning models.
- BA-MCTS: Bayes Adaptive Monte Carlo Tree Search for Offline Model-based RL
-
This work introduces Bayes Adaptive MDP (BAMDP) into offline model-based RL for the first time, proposing Continuous BAMCP to solve Bayesian planning in continuous state/action spaces. By combining pessimistic reward penalties with search-based policy iteration (the "RL + Search" paradigm), it significantly outperforms 19 baselines on 12 D4RL tasks (Cohen's \(d > 1.8\)) and is successfully applied to nuclear fusion tokamak control.
- Bayesian Ensemble for Sequential Decision-Making
-
This paper proposes Bayesian Ensemble, which models the choice of "which ensemble member to select" as an inner bandit with Bayesian updates. By dynamically adjusting the sampling distribution of ensemble members using reward feedback in contextual bandits and DQN, the method significantly reduces regret and enhances cumulative returns in MiniGrid reinforcement learning tasks with negligible overhead compared to standard ensemble+ methods.
- Bayesian Robust Cooperative Multi-Agent Reinforcement Learning Against Unknown Adversaries
-
To address "unknown goal" adversaries in cooperative multi-agent reinforcement learning (c-MARL) deployment, this paper moves beyond learning a single worst-case max–min policy. Instead, it discretizes an infinite variety of adversarial strategies into finite types based on their "disruption severity." Representative worst-case adversaries are learned for each type, and a robust adaptive policy, BATPAL, is trained using a belief network and simultaneous gradient updates. BATPAL consistently outperforms existing SOTA against both seen and unseen attacks across four benchmarks.
- Belief-Based Offline Reinforcement Learning for Delay-Robust Policy Optimization
-
DT-CORL utilizes a Transformer belief model to predict the current latent state from delayed observations and historical actions. By embedding this belief representation directly into conservative offline policy iteration, the policy trained on delay-free offline data maintains stable control performance during deployment under both deterministic and stochastic delays.
- Benefits and Pitfalls of Reinforcement Learning for Language Model Planning: A Theoretical Perspective
-
This paper utilizes path planning on graphs as an analyzable abstraction for language model planning. It theoretically demonstrates that SFT tends to learn co-occurrence memorization, and the advantage of policy gradient primarily stems from exploration but at the cost of output diversity. In contrast, Q-learning with process rewards is shown to potentially maintain correctness, diversity, and off-policy training capabilities simultaneously.
- Beyond Binary Rewards: Training LMs to Reason About Their Uncertainty
-
This paper proposes RLCR (Reinforcement Learning with Calibration Rewards), which overlays a Brier score term on top of standard "binary correctness rewards." This allows reasoning models to output a calibrated confidence while generating answers. Without significant loss in accuracy, it reduces Expected Calibration Error (ECE) from 0.37 to 0.03 on HotpotQA and reverses the degradation trend of standard RL—where models typically become "more confident and more chaotic" as training progresses—on out-of-distribution (OOD) tasks.
- Beyond Distributions: Geometric Action Control for Continuous Reinforcement Learning
-
Addressing the geometric distortion caused by the "unbounded support + tanh squashing" of Gaussian policies in bounded action spaces, this paper proposes GAC (Geometric Action Control). It decomposes action generation into a "unit direction vector on a sphere + a learnable concentration scalar," replacing probabilistic sampling with spherical interpolation. This reduces parameter count from \(2d\) to \(d+1\) and sampling complexity from \(O(dk)\) to \(O(d)\), matching or outperforming SAC across 6 MuJoCo and 6 DMControl tasks (e.g., Ant-v4 +37.6%, quadruped-run +112%).
- Beyond Noisy-TVs: Noise-Robust Exploration Via Learning Progress Monitoring
-
To address the classic "Noisy-TV" trap in intrinsic motivation exploration, this paper proposes Learning Progress Monitoring (LPM): using "how much the model improved this round compared to the last" as an intrinsic reward instead of prediction error or novelty. Since unlearnable random transitions yield zero progress, the agent is naturally immune to noise. LPM achieves faster convergence, higher state coverage, and superior extrinsic returns across MNIST, 3D mazes, and Atari compared to SOTA.
- Beyond Pass@1: Self-Play with Variational Problem Synthesis Sustains RLVR
-
To address the issues of policy entropy collapse and Pass@k stagnation in standard RLVR training, this paper proposes SVS (Self-play with Variational problem Synthesis). In this method, the policy model uses its own correct solutions to difficult problems to "back-synthesize" a set of variant problems with the same answers. Solving these new problems online expands the training data and sustains policy entropy, achieving absolute gains in Pass@32 of 18.3% and 22.8% on AIME24/25, respectively.
- Beyond Penalization: Diffusion-based Out-of-Distribution Detection and Selective Regularization in Offline Reinforcement Learning
-
DOSER utilizes two diffusion models to characterize the behavior policy and state distribution, respectively, using single-step denoising reconstruction error as a reliable OOD (Out-of-Distribution) metric. By leveraging a dynamics model, OOD actions are further categorized into "beneficial" and "harmful" types; the former receives a reward bonus while the latter is penalized. This approach suppresses value overestimation without stifling potential exploration in offline RL, achieving state-of-the-art results on D4RL, particularly on sub-optimal datasets.
- Beyond Softmax and Entropy: Convergence Rates of Policy Gradients with \(f\)-SoftArgmax Parameterization & Coupled Regularization
-
By replacing the default "softmax parameterization + entropy regularization" in RL with the coupled duo of "\(f\)-softargmax parameterization + homologous \(f\)-divergence regularization", the authors prove that the coupled objective satisfies the Polyak-Łojasiewicz (PL) inequality. This allows for the first explicit last-iterate convergence guarantee for stochastic policy gradients without preconditioning. Specifically, Tsallis divergence improves the exponential sample complexity of softmax to polynomial complexity.
- Boolean Satisfiability via Imitation Learning
-
ImitSAT is proposed as the first CDCL solver branching strategy based on imitation learning. By compressing solver runs into conflict-free KeyTrace expert sequences, branching decisions are modeled as an autoregressive prediction task conditioned on prefixes. This approach significantly reduces the number of propagations and solving time with a small query budget, while demonstrating strong generalization capabilities on structured SAT problems.
- Boosting Multi-Domain Reasoning of LLMs via Curvature-Guided Policy Optimization
-
Addressing the cross-domain conflict in multi-domain RL training for LLMs (e.g., "improving math degrades writing"), CGPO draws on the idea of Newton's method using curvature to precondition gradients. Instead of explicitly computing the Hessian, CGPO splits a batch into domain-specific sub-batches and performs serial updates in a random order. Domains updated later naturally perceive the curvature perturbations left by earlier ones, which in expectation is equivalent to maximizing the inner product of gradients across domains—implicitly aligning cross-domain gradients. On Qwen2.5-3B/7B across four domains and seven benchmarks, the average score consistently outperforms joint training and gradient balancing baselines (7B: 59.59 vs. Joint 56.62) with almost zero additional overhead.
- BoreaRL: A Multi-Objective Reinforcement Learning Environment for Climate-Adaptive Boreal Forest Management
-
BoreaRL is the first multi-objective reinforcement learning (MORL) environment for climate-adaptive boreal forest management. Using a physical simulator coupling energy-carbon-water fluxes, it poses the conflicting goals of "maximizing carbon sequestration vs. protecting permafrost" to MORL agents. The study reveals a severe asymmetry in learning difficulty—the carbon goal is easily mastered while the permafrost goal is nearly unlearnable—and shows that a simple "site-selection" curriculum strategy surprisingly outperforms standard preference-conditioned methods.
- BranchGRPO: Stable and Efficient GRPO with Structured Branching in Diffusion Models
-
BranchGRPO transforms the "independent sequential sampling" of GRPO in diffusion/flow models into a structured branching tree with shared prefixes. This tree structure simultaneously addresses two issues: prefix reuse amortizes sampling costs, and leaf-reward backward fusion provides depth-normalized dense step-level advantages. Combined with width/depth pruning to backpropagate gradients only on valuable subsets, it achieves up to a 16% improvement in HPSv2.1 image alignment compared to DanceGRPO and reduces single-round training time by nearly 55%, with a hybrid variant reaching 4.7× acceleration.
- Breaking Barriers: Do Reinforcement Post Training Gains Transfer To Unseen Domains?
-
Through observational studies (18 open-source RPT models) and interventional studies (single-domain GRPO training), this work systematically reveals the generalization limitations of Reinforcement Post Training (RPT/RLVR). While RPT significantly improves performance within the training domain, cross-domain generalization is inconsistent: gains transfer between structured domains (Math ↔ Code) but fail to generalize to unstructured domains (Law/Finance/Medical). This finding remains consistent across different algorithms, model scales, and training steps.
- Breaking Safety Paradox with Feasible Dual Policy Iteration
-
This paper identifies a counter-intuitive "Safety Paradox" in Safe RL: as a policy becomes safer, constraint-violating samples become sparser, causing the estimation of the feasibility function to deteriorate and ultimately undermining safety. The authors propose FDPI, which employs a dedicated "dual policy" to intentionally collect violation samples. Combined with importance sampling and KL constraints, FDPI achieves the lowest violations and near-highest returns on Safety-Gymnasium.
- Bridging Successor Measure and Online Policy Learning with Flow Matching-Based Representations
-
This paper proposes Successor Flow Features (SF2), which approximates the Successor Measure (SM) using flow matching generative models. By forcing the vector field to decompose into a linear structure of "time-invariant state-action embedding \(\psi(s,a)\) + time-varying projection \(\zeta(s',k)\)," the authors bridge SM estimation with online policy optimization. When integrated into TD3/SAC across seven DeepMind Control continuous control tasks, SF2 demonstrates superior sample efficiency and training stability compared to strong successor feature baselines.
- Bridging the Performance-Gap Between Target-Free and Target-Based Reinforcement Learning
-
By using an old copy of the last linear head of the online network as the target network—while sharing all other parameters—and integrating iterated Q-learning to learn multi-step Bellman iterations in parallel, this method closes the performance gap between target-free and target-based RL with almost no additional memory overhead.
- Buffer Matters: Unleashing the Power of Off-Policy Reinforcement Learning in Large Language Model Reasoning
-
Addressing two major wastes in online (on-policy) RLVR training—"inability to learn from hard samples" and "sampled data discarded after one use"—this paper proposes the off-policy framework BAPO (Batch Adaptation Policy Optimization). It utilizes a "difficulty-aware experience replay + adaptive batch construction" mechanism to bring historical hard problems and high-quality trajectories back into training batches. It theoretically proves that the adapted batches still satisfy the lower bound for policy improvement. Ultimately, BAPO achieves an average 12.5% improvement over GRPO across math, planning, and visual geometry tasks, solving 40.7% of hard problems that the base model consistently failed.
- CaRe-BN: Precise Moving Statistics for Stabilizing Spiking Neural Networks in Reinforcement Learning
-
To address the training instability in Spiking Neural Networks (SNNs) caused by inaccurate Batch Normalization (BN) moving statistics in online Reinforcement Learning (RL), this paper proposes CaRe-BN. The method utilizes "Confidence-aware Adaptation" (Kalman-style weighting) for real-time, low-variance estimation of BN statistics, and "Periodic Recalibration" (resampling large batches from the replay buffer) for bias correction. This improves SNN agent performance on Atari/MuJoCo by up to 22.6%, even surpassing corresponding ANNs by 5.9%, with zero additional inference overhead.
- Causally Robust Reward Learning from Reason-Augmented Preference Feedback
-
ReCouPLe treats a short natural language reason (e.g., "because it avoids a collision") as a projection axis in the embedding space. It decomposes trajectory representations into "reason-aligned" and "reason-orthogonal" components, ensuring preferences are explained only by the aligned component. This strips away spurious features and significantly outperforms binary preference baselines in distribution shifts and zero-shot task transfer.
- CDE: Curiosity-Driven Exploration for Efficient Reinforcement Learning in Large Language Models
-
To address the issues of insufficient exploration, premature convergence, and entropy collapse in RLVR (Reinforcement Learning from Verifiable Rewards) training for LLMs, CDE guides exploration using the model's own "curiosity." It utilizes the perplexity (PPL) of generated responses at the actor side and the variance of value estimates from a multi-head critic at the critic side as exploration rewards. Without training additional representation modules, CDE achieves stable improvements of approximately \(+3\) points over standard GRPO/PPO on mathematical reasoning benchmarks such as AIME, while simultaneously fixing a training failure mode termed "calibration collapse."
- Chain-of-Context Learning: Dynamic Constraint Understanding for Multi-Task VRPs
-
Proposes Chain-of-Context Learning (CCL), which achieves step-by-step dynamic constraint-aware decoding through Relevance-Guided Context Reformulation (RGCR, adaptively aggregating constraint information to build context) and Trajectory-Shared Node Re-embedding (TSNR, updating nodes shared across trajectories to avoid redundant computation). It comprehensively outperforms existing methods on 48 VRP variants (16 in-distribution + 32 out-of-distribution).
- Chessformer: A Unified Architecture for Chess Modeling
-
By treating the 64 board squares as tokens and adding a "Geometric Attention Bias" (GAB) dynamically generated per position to the self-attention mechanism, Chessformer utilizes a unified architecture to simultaneously push "engine strength," "human move prediction," and "interpretability"—three long-separated objectives—to SOTA. The 79M-parameter MAIA-3 improves human move matching to 57.1% with less than a quarter of the size of its competitors, while the version integrated into Leela Chess Zero gained 100+ Elo and defeated Stockfish in multiple top-tier engine tournaments.
- Chunking the Critic: A Transformer-based Soft Actor-Critic with N-Step Returns
-
The MLP critic in SAC is replaced with a lightweight causal Transformer, allowing the critic to evaluate all prefixes of a "state + short action sequence" simultaneously. By using multi-horizon N-step returns for supervision without requiring importance sampling, the method maintains a strictly single-step policy while significantly outperforming standard SAC and episodic baselines on long-range, sparse-reward tasks.
- Composition of Memory Experts for Diffusion World Models
-
Addressing the structural contradiction where "longer context improves world model accuracy but explodes computational cost," this paper shifts the memory burden from a single backbone to three independent diffusion experts (short-term, long-term, and spatial long-term). These are fused during sampling via a "Product of Contrastive Experts" (PoCE), maintaining temporal and spatial consistency over 500+ frames at a cost significantly lower than scaling attention.
- ComputerRL: Scaling End-to-End Online Reinforcement Learning for Computer Use Agents
-
ComputerRL proposes an end-to-end online RL framework for desktop computer use agents. It unifies programmatic API calls and human-like GUI operations into a single action space via the API-GUI paradigm, establishes a distributed asynchronous RL infrastructure capable of running thousands of concurrent virtual desktops, and utilizes Entropulse (alternating RL and SFT) to combat entropy collapse during long training. Consequently, the 9B GLM-ComputerRL achieves a 48.9% success rate on OSWorld, surpassing larger closed/open-source agents such as OpenAI CUA o3, UI-TARS-1.5, and Claude 4.
- Context and Diversity Matter: The Emergence of In-Context Learning in World Models
-
This paper reformulates the "adaptability of world models" as an In-Context Learning (ICL) problem, decomposing it into two mechanisms: "Environment Recognition (ER)" and "Environment Learning (EL)". By deriving error upper bounds for both, the authors demonstrate that only sufficiently long contexts + sufficiently diverse environments can catalyze genuine EL. They empirically validate this theory using L2World, a linear attention long-context world model, on cart-pole and indoor navigation tasks.
- Continuous-Time Value Iteration for Multi-Agent Reinforcement Learning
-
The authors propose the VIP (Value Iteration via PINN) framework, which marks the first use of Physics-Informed Neural Networks (PINNs) to solve the HJB partial differential equations in continuous-time multi-agent reinforcement learning. By introducing a Value Gradient Iteration (VGI) module to iteratively refine value gradients, the method consistently outperforms both discrete-time and continuous-time baselines on continuous-time MPE and MuJoCo multi-agent tasks.
- Controllable Exploration in Hybrid-Policy RLVR for Multi-Modal Reasoning
-
CalibRL redefines expert data as a distribution calibration baseline (rather than a strict imitation target), achieving fine-grained control over the exploration-exploitation balance in MLLM reasoning training through LeakyReLU asymmetric activation and advantage weighting. This effectively addresses the entropy collapse problem in RLVR and significantly outperforms GRPO/DAPO on geometric reasoning tasks.
- Convergence of an actor-critic gradient flow for entropy regularised MDPs in general spaces
-
This paper establishes the stability and global convergence of a coupled actor-critic gradient flow—where the critic utilizes TD learning and the actor employs Policy Mirror Descent (PMD)—for infinite-horizon MDPs with continuous state/action spaces and entropy regularization. The core conclusion is that the system avoids finite-time blow-up and converges to the optimal regularized value function at an exponential rate, provided the critic updates at an exponentially faster timescale.
- Correlated Policy Optimization in Multi-Agent Subteams
-
The joint policy in cooperative multi-agent systems is decomposed using a Directed Acyclic Graph (DAG/Bayesian Network) where agents are fully correlated within "subteams" and independent across teams. Under the condition of decomposable rewards/transitions, it is proven that regularized policy gradients converge to near-optimal policies. A heuristic for dynamic subteam assembly based on "dependency scores + edge budget" is provided and integrated into MAPPO/MADDPG, outperforming standard baselines across multiple benchmarks.
- Critique-RL: Training Language Models for Critiquing Through Two-Stage Reinforcement Learning
-
Critique-RL employs an online RL scheme to train "critique models" without relying on annotations from stronger supervisors. It first stabilizes discriminability using direct rule-based rewards, then enhances helpfulness via indirect rewards based on refinement accuracy while maintaining discriminability through regularization, enabling weak models to produce accurate and helpful feedback.
- CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning
-
CUDA-L1 is proposed as a three-stage pipeline framework based on Contrastive Reinforcement Learning (Contrastive RL). It transforms an LLM with initially weak CUDA capabilities into an efficient CUDA optimizer, achieving an average speedup of 3.12× (with a peak of 120×) across 250 CUDA kernels in KernelBench, while demonstrating transferability across GPU architectures.
- DEAS: DEtached value learning with Action Sequence for Scalable Offline RL
-
DEAS treats "continuous H-step actions" as the input unit for value functions in offline RL, compressing the effective planning horizon similar to n-step TD. To avoid value overestimation caused by action space expansion, it employs IQL-style "detached value learning" (critic training is completely independent of the actor) + categorical distributional value estimation + dual discount factors to stabilize training. It significantly outperforms baselines like FQL/Q-Chunking on OGBench long-horizon tasks and can be directly integrated into large-scale VLAs such as GR00T and π0 to improve real- robot manipulation success rates.
- Decoupled Q-Chunking
-
Addressing the contradiction where "chunked critics accelerate value propagation but require the policy to output a whole open-loop action chunk—which is hard to learn and inflexible," this paper proposes Decoupled Q-Chunking (DQC). By decoupling the critic's action chunk length \(h\) from the policy's action chunk length \(h_a\) (\(h_a \ll h\)), the policy only predicts a short section of actions. This policy is guided by a "partial critic" optimistically distilled from a larger critic, thereby retaining the multi-step value propagation advantages of chunked critics while bypassing the difficulty of learning long-chunk policies. This approach consistently outperforms previous SOTA on the most challenging long-horizon goal-conditioned tasks in OGBench.
- Deep SPI: Safe Policy Improvement via World Models
-
This work constructs a theoretical framework for Safe Policy Improvement (SPI), unifying world models and representation learning with policy update guarantees. By constraining policy updates through a neighborhood operator based on importance ratios, it ensures monotonic improvement and convergence. Combined with local transition/reward losses to control world model quality and representation stability, the proposed DeepSPI algorithm matches or exceeds PPO and DeepMDP on the ALE-57 benchmark.
- DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Tree-based Search
-
DeepSearch shifts MCTS from the inference phase forward into the RLVR training loop, utilizing global frontier selection, confident error trajectory supervision, and replay buffer caching to enhance exploration efficiency in mathematical reasoning. It surpasses extended training baselines on a 1.5B model with a 62.95% average accuracy while significantly reducing GPU overhead.
- Deft Scheduling of Dynamic Cloud Workflows with Varying Deadlines via Mixture-of-Experts
-
DEFT introduces the Mixture-of-Experts (MoE) architecture to dynamic cloud workflow scheduling for the first time. It replaces the single-path feed-forward policy head in traditional DRL schedulers with a set of experts specialized in different "deadline tightness levels," coupled with a graph-adaptive gating network that understands DAG structures and urgency for step-by-step routing. In large-scale scenarios, it reduces total scheduling costs by nearly 30% compared to the SOTA.
- Demystifying The Mechanisms Behind Emergent Exploration in Goal-Conditioned RL
-
This paper uses a cognitive science-inspired "Rational Analysis + Intervention + Minimal Modeling" triplet to deconstruct why reward-free Single-Goal Contrastive RL (SGCRL) exhibits spontaneous exploration. The conclusion is that the actor maximizes an implicit reward shaped by the critic's representation (state-goal \(\psi\)-similarity), and this exploration-exploitation dynamic emerges from low-rank representations learned via contrastive learning, rather than neural network function approximation.
- Dichotomous Diffusion Policy Optimization
-
DIPOLE decomposes the exponential weight of the optimal policy in KL-regularized RL into a pair of bounded "dichotomous policies" (one pursuing high returns, the other low returns), stabilizes training using sigmoid weighting, and linearly combines their scores at inference—similar to classifier-free guidance—to achieve stable diffusion policy optimization with controllable greediness.
- Direct Preference Optimization for Primitive-Enabled Hierarchical RL: A Bilevel Approach
-
DIPPER formulates goal-conditioned hierarchical reinforcement learning (HRL) as a bilevel optimization problem. It trains the high-level subgoal policy using DPO with primitive regularization based on the low-level value function. This simultaneously mitigates non-stationarity caused by low-level policy evolution and the generation of unreachable subgoals by the high-level policy. It significantly outperforms various HRL, DPO, and flat RL baselines on sparse-reward robot navigation and manipulation tasks.
- Distributional value gradients for stochastic environments
-
Addressing the failure of "value gradient for credit assignment" methods like MAGE in stochastic/noisy environments, this paper extends distributional RL from "modeling return distributions" to "jointly modeling the return and its action-gradient distribution." It proposes a Sobolev distributional Bellman operator, a differentiable world model, and the max-sliced MMD metric, providing the first contraction proof for gradient-aware RL and demonstrating superior robustness over deterministic gradient methods on noisy MuJoCo tasks.
- Distributionally Robust Cooperative Multi-agent Reinforcement Learning with Value Factorization
-
This paper introduces distributionally robust reinforcement learning into cooperative multi-agent value factorization and proposes the DrIGM principle. This ensures that the robust greedy actions of individual agents can still be combined into a globally robust optimal joint action. Based on this, the authors implement robust versions of VDN, QMIX, and QTRAN that are more stable under environmental distribution shifts.
- DiVE-k: Differential Visual Reasoning for Fine-grained Image Recognition
-
The DiVE-k framework is proposed, which leverages the top-k generation results of Large Vision-Language Models (LVLMs) to construct multiple-choice questions (MCQs). Through GRPO reinforcement learning, the model is trained to perform differential visual reasoning, significantly outperforming existing methods in base-to-novel generalization for fine-grained image recognition.
- Diversity-Incentivized Exploration for Versatile Reasoning
-
DIVER identifies a strong positive correlation between "global sequence-level diversity of a set of responses" and the reasoning capability of LLMs. By formulating this diversity as an intrinsic reward, applying potential function shaping to preserve optimal policy invariance, and using conditional shaping to prevent reward hacking, RLVR significantly improves Pass@k and cross-domain generalization in mathematical reasoning without compromising Pass@1.
- Divide, Harmonize, Then Conquer It: Shooting Multi-Commodity Flow Problems with Multimodal Language Models
-
This paper proposes the Pram framework, the first to utilize Multimodal Language Models (MLM) to solve Multi-Commodity Flow (MCF) problems. By partitioning the original problem into sub-problems and using Multi-Agent Reinforcement Learning (MARL) to coordinate global consistency, the method is theoretically proven to converge to the optimal solution. Empirical results show it is 1-2 orders of magnitude faster than LP solvers while achieving near-optimal performance.
- Do Not Let Low-Probability Tokens Over-Dominate in RL for LLMs
-
This paper points out that during RL training for LLMs (such as GRPO), low-probability tokens dominate parameter updates due to excessively large gradient magnitudes, suppressing equally important high-probability tokens. The authors propose two simple methods—Advantage Reweighting (linearly scaling down low-probability token weights based on probability) and Lopti (updating low-probability tokens before high-probability tokens)—improving GRPO by up to 46.2% on K&K logic puzzles.
- Does “Do Differentiable Simulators Give Better Policy Gradients?” Give Better Policy Gradients?
-
This work is a "revisitation" of the identically titled paper by Suh et al. (2022). The authors replace the original REINFORCE-based discontinuity detection with a lightweight statistical test (DDCG) that depends only on function values and gradient variances, robustly reproducing and improving the original method with a single hyperparameter. More importantly, they propose Step-wise Inverse Variance Weighting (IVW-H), which outperforms GIPPO on MuJoCo control tasks without any discontinuity detection. This demonstrates that while "estimator switching" is useful in controlled studies, the real bottleneck in practical robot control is often variance rather than "empirical bias."
- Don't Just Fine-tune the Agent, Tune the Environment
-
Proposes the Environment Tuning training paradigm, utilizing structured curricula, actionable environment-enhanced feedback, and fine-grained progress rewards to enable LLM agents to learn complex multi-turn tool use from scratch with only 400 training samples, while achieving superior out-of-distribution generalization.
- DOPPLER: Dual-Policy Learning for Device Assignment in Asynchronous Dataflow Graphs
-
DOPPLER models the problem of assigning dataflow graph operators across multiple GPUs to minimize execution time as a sequential decision-making problem. By utilizing a pair of policies (SEL to select the next operator and PLC to assign it a device) combined with a three-stage training pipeline (Imitation Learning → Simulator RL → Online Real-Device RL), it reduces execution time by up to 52.7% compared to the strongest baselines in asynchronous, work-conserving (WC) execution environments.
- DR-SAC: Distributionally Robust Soft Actor-Critic for Reinforcement Learning under Uncertainty
-
DR-SAC is the first actor-critic distributionally robust reinforcement learning (DR-RL) algorithm designed for continuous action spaces in offline settings. It performs "worst-case maximum entropy optimization" over a transition distribution uncertainty set characterized by a KL divergence ball. The authors provide a distributionally robust soft policy iteration with convergence guarantees and operationalize the algorithm for continuous control using functional rewriting and VAE generative models. Under perturbations, the average return is up to 9.8× higher than SAC, with training times reduced by over 80% compared to the existing DR-RL method, RFQI.
- Dual-Objective Reinforcement Learning with Novel Hamilton-Jacobi-Bellman Formulations
-
This paper extends HJ-RL (Hamilton-Jacobi Reinforcement Learning) from single "Reach / Avoid / Reach-Avoid" objectives to two types of dual-objective problems—Reach-Always-Avoid (RAA) and Reach-Reach (RR). It demonstrates that their Bellman equations can be exactly decomposed into combinations of previously studied simple reach/avoid sub-problems. Based on this, the DOHJ-PPO algorithm is designed, surpassing 10 Lagrangian-based and HJ-RL baselines in success rate, safety, and speed.
- Dual-Robust Cross-Domain Offline Reinforcement Learning Against Dynamics Shifts
-
Ours is the first work to simultaneously address train-time robustness (source-target dynamics mismatch) and test-time robustness (deployment environment dynamics shift) in cross-domain offline RL. The proposed DROCO algorithm centers on the Robust Cross-Domain Bellman (RCB) operator—applying robust Bellman updates to source data and standard in-sample updates to target data. Through dual reconstruction, intractable dynamics uncertainty is mapped to state-space perturbations. On D4RL benchmarks, it achieves a total score of 1105.2, surpassing the runner-up by 14%, with performance degradation under hard-level dynamics perturbations only half that of baselines.
- Dual Goal Representations
-
The paper proposes "dual goal representations," which encode goals using the set of temporal distances from all states to the target state. The authors provide theoretical proof that this representation is sufficient for optimal policy recovery and naturally filters exogenous noise. A practical learning algorithm based on asymmetric inner product parametrization is designed, consistently improving the performance of three mainstream offline GCRL methods as a plug-and-play module across 20 OGBench tasks.
- DuPO: Enabling Reliable Self-Verification via Dual Preference Optimization
-
DuPO relaxes traditional dual learning from "strictly reversible task pairs" to "complementary dependency relationships"—allowing the dual task to reconstruct only an unknown component of the input from the primary task output. By using reconstruction consistency as a self-supervised reward, it achieves RL optimization without any labels for irreversible tasks like mathematical reasoning and multilingual translation.
- DVLA-RL: Dual-Level Vision-Language Alignment with Reinforcement Learning Gating for Few-Shot Learning
-
Ours proposes the DVLA-RL framework, which generates complementary low-level attributes and high-level descriptions through Dual-level Semantic Construction (DSC). It utilizes RL-based Gated Attention (RLA) to dynamically balance the contributions of self-attention and cross-attention across different network layers, achieving hierarchical vision-language alignment from low-level to high-level features and reaching SOTA on 9 few-shot learning benchmarks.
- Echo: Towards Advanced Audio Comprehension via Audio-Interleaved Reasoning
-
This paper proposes a new paradigm called audio-interleaved reasoning, where audio is treated as an active component in the reasoning process rather than a static context. This allows Large Audio-Language Models (LALMs) to dynamically locate and re-listen to audio segments during inference. Through a two-stage SFT+RL training framework and a structured data generation pipeline, the Echo model is constructed, surpassing GPT-4o and Gemini-2.0-Flash on both expert-level and general audio understanding benchmarks.
- Efficient Estimation of Kernel Surrogate Models for Task Attribution
-
This work proposes Kernel Surrogate Models (KernelSM) for task attribution, utilizing RBF kernel ridge regression to capture non-linear interaction effects between tasks. Combined with an efficient estimation algorithm via gradient projection to avoid redundant training, it achieves a 25% correlation improvement over linear surrogate and influence function baselines in scenarios such as mathematical reasoning, in-context learning, and multi-objective RL.
- Efficient Morphology-Control Co-Design via Stackelberg Proximal Policy Optimization
-
The study reformulates the joint optimization of "robot morphology design + control policy" as a Phase-Separated Stackelberg Game (where morphology acts as the leader and control as the follower). It derives Stackelberg policy gradients capable of propagating through "non-differentiable morphology editing interfaces," encapsulated into Stackelberg PPO. This allows morphology updates to actively anticipate how control policies will adapt, resulting in stable training and an average performance improvement of 20.66% over the strongest baseline.
- Efficient Offline Reinforcement Learning via Peer-Influenced Constraint
-
This paper proposes Peer-Influenced Constraint (PIC): instead of treating only the action associated with the current state in the dataset as a conservative constraint, it borrows candidate actions from similar "peer states" and uses a critic to select superior in-distribution actions to guide the actor. Furthermore, it combines this with a small-scale ensemble critic to form EPIC, achieving higher average scores on D4RL MuJoCo, AntMaze, and Adroit while maintaining low training overhead.
- Efficient Reinforcement Learning by Guiding World Models with Non-Curated Data
-
NCRL first pre-trains a task-agnostic world model using reward-free, mixed-quality, multi-embodiment non-curated data, then guides exploration during online RL via retrieval-based experience replay and behavior cloning prior policies. This significantly mitigates the distribution mismatch between offline pre-training and online fine-tuning, achieving performance comparable to training from scratch with several times the sample budget across 72 visual-motor control tasks using only 150k interaction steps.
- EGG-SR: Embedding Symbolic Equivalence into Symbolic Regression via Equality Graph
-
Ours proposes the Egg-SR unified framework, which embeds symbolic equivalence into MCTS, DRL, and LLM methodologies via equality graphs (e-graph). This achieves subtree pruning, gradient variance reduction, and feedback prompt enhancement respectively. Theoretical proofs demonstrate that Egg-MCTS tightens regret bounds and Egg-DRL reduces gradient estimation variance. Experiments confirm consistent improvements in expression discovery accuracy.
- Ego-Foresight: Self-supervised Learning of Agent-Aware Representations for Improved RL
-
Inspired by human "motion prediction," Ego-Foresight utilizes the cue that "the agent's body configuration is predictable by future actions when it moves" to disentangle agent features from scene features without any supervised masks. Integrated as an auxiliary task into DrQ-v2 and TD-MPC2, it significantly enhances the sample efficiency and performance of visual RL.
- ELMUR: External Layer Memory with Update/Rewrite for Long-Horizon RL Problems
-
ELMUR equips every layer of a Transformer with a structured external memory. Through bi-directional cross-attention for reading/writing and LRU rules (replacement/convex combination) for maintenance, it achieves a bounded yet persistent memory. It extends the effective memory horizon to 100,000 times the attention window, achieving a 100% success rate on the million-step T-Maze and nearly doubling the success rate of strong baselines in sparse-reward visual robotic manipulation.
- EMFuse: Energy-based Model Fusion for Decision Making
-
EMFuse unifies "direct policy fusion" and "dynamics model fusion"—two seemingly distinct tasks—under the framework of Energy-based Models (EBM). Summing energies is equivalent to multiplying distributions (Product of Experts, PoE). This enables training-free fusion of multiple LLM experts during inference and utilizes a new ADETM architecture to bypass the exponential explosion in fusing dynamics ensembles. It achieves improvements of 0.34%–6.63% on discrete decision benchmarks and adds 2.3–7.4 normalized points on D4RL continuous control tasks.
- Enhancing Generative Auto-bidding with Offline Reward Evaluation and Policy Search
-
Ours proposes AIGB-Pearl, equipping "Generative Auto-bidding" (AIGB) with a trajectory evaluator as an offline reward signal. It utilizes a theoretically guaranteed KL-Lipschitz constrained score-maximization to enable the generative planner to safely explore high-quality trajectories beyond the offline dataset, thereby breaking the performance ceiling of pure imitation learning.
- Enough is as good as a feast: A Comprehensive Analysis of How Reinforcement Learning Mitigates Task Conflicts in LLMs
-
This paper systematically compares the impact of two post-training paradigms, SFT and RL, on "model merging." It discovers that RL-trained models experience significantly less performance degradation after merging compared to SFT-trained ones. Practical and theoretical explanations are provided from three perspectives: on-policy data, adaptive decay of RL optimization objectives, and the joint optimization of positive and negative samples.
- Entropy-Preserving Reinforcement Learning (REPO / ADAPO)
-
This paper reveals the theoretical root cause of systemic policy entropy collapse in policy gradient RL algorithms during LLM post-training (the positive correlation between advantage functions and log-probabilities). It proposes two complementary solutions: REPO (decorrelation by modifying the advantage function) and ADAPO (adaptive asymmetric clipping), achieving SOTA performance on interactive tool-use tasks.
- Entropy Regularizing Activation: Boosting Continuous Control, Large Language Models, and Image Classification with Activation as Entropy Constraints
-
ERA (Entropy Regularizing Activation) imposes an entropy lower bound constraint by appending a specialized activation function to the network output layer. This approach requires no modification to the loss function and improves performance across continuous control RL, LLM reasoning, and image classification within a single framework.
- Erase to Improve: Erasable Reinforcement Learning for Search-Augmented LLMs
-
This paper proposes Erasable Reinforcement Learning (ERL). In the multi-hop reasoning trajectories of search-augmented LLMs, faulty sub-queries or sub-answers are identified through dense process rewards, then erased in-place and regenerated. This transforms fragile "one-error-ruins-all" trajectories into recoverable robust processes. The resulting ESearch model achieves new SOTA performance across four multi-hop QA benchmarks.
- Escaping Policy Contraction: Contraction-Aware PPO (CaPPO) for Stable Language Model Fine-Tuning
-
This paper points out that PPO in RLHF causes the policy "support set" to gradually contract (entropy collapse, increased repetition, and the zeroing out of probabilities for reasonable SFT answers). It proposes the Support Retention Ratio (SRR) to quantify this phenomenon and designs CaPPO—treating reward, entropy, and KL as equal objectives for minimum-norm multi-gradient updates, combined with an entropy scheduling controller. CaPPO significantly recovers diversity and SRR without dropping alignment win rates (increasing them by +2~4 points instead).
- EUBRL: Epistemic Uncertainty Directed Bayesian Reinforcement Learning
-
EUBRL directly integrates "epistemic uncertainty" into the RL objective function via probabilistic inference, utilizing a binary "uncertainty variable" to adaptively switch between exploration and exploitation. Theoretically, it is the first to achieve near-minimax optimal regret bounds and sample complexity simultaneously in infinite-horizon undiscounted MDPs.
- ExGRPO: Learning to Reason from Experience
-
This paper presents the first systematic study on what types of reasoning experiences are most valuable for RLVR. It identifies that medium-difficulty problems combined with low-entropy trajectories are most effective. Based on this, the ExGRPO framework for experience management and hybrid policy optimization is proposed, achieving an average gain of +3.5 points in mathematical reasoning and +7.6 points in general reasoning.
- Exo-Plore: Exploring Exoskeleton Control Space through Human-Aligned Simulation
-
Ours proposes the Exo-plore framework, which combines neuromechanical simulation with deep reinforcement learning to optimize hip exoskeleton control parameters without human trials, enabling generalization to pathological gait scenarios.
- Expertise Can Be Helpful for Reinforcement Learning-based Macro Placement
-
EXPlace explicitly encodes four types of expert knowledge accumulated by chip layout engineers (dataflow, macro grouping, periphery bias, and I/O keepout) into dense rewards and state masks for RL. It then employs Direct Preference Optimization (DPO) to mimic the expert workflow of "iterative refinement based on backend PPA feedback" for timing fine-tuning. This allows RL-based placement to significantly outperform analytical, black-box, and RL peers on real sign-off metrics such as TNS/WNS for the first time.
- Exploration vs Exploitation: Rethinking RLVR through Clipping, Entropy, and Spurious Reward
-
Through theoretical derivation and cross-model experiments, it is demonstrated that the learning signal provided by clipping bias in RLVR is negligible (\(\leq 1/17\)). The effective mechanism is the implicit compression effect of clipping on policy entropy. A reward mislabeling model is proposed to explain how strong models benefit from random rewards.
- Exploratory Diffusion Model for Unsupervised Reinforcement Learning
-
ExDM introduces diffusion models to unsupervised reinforcement learning for the first time. It utilizes diffusion models to fit heterogeneous state distributions within the replay buffer, using "poorly fitted regions" as score-based intrinsic rewards to drive exploration. Additionally, it designs an efficient online fine-tuning algorithm for diffusion policies with convergence guarantees.
- EXPO: Stable Reinforcement Learning with Expressive Policies
-
EXPO bypasses the instability of backpropagating value gradients through diffusion/flow-matching chains by combining "imitation learning for the base expressive policy + lightweight Gaussian editing for Q-value maximization + on-the-fly selection of the highest-value action," achieving a 2-3x improvement in online RL fine-tuning sample efficiency.
- FAPO: Flawed-Aware Policy Optimization for Efficient and Reliable Reasoning
-
Addressing the "flawed-positive rollout" problem (correct answer but flawed reasoning) in RLVR training, this paper proposes the FAPO algorithm. It utilizes a GenRM to detect flawed reasoning and implements a "first exploit, then suppress" natural learning trajectory through a parameter-free reward penalty mechanism, simultaneously improving result accuracy, process reliability, and training stability.
- Fine-tuning Behavioral Cloning Policies with Preference-Based Reinforcement Learning
-
This paper provides the first rigorous analysis for the "offline imitation + online preference fine-tuning" paradigm, which is widely used in RLHF and robotics but lacks theoretical support. It proposes the BRIDGE algorithm, which first constructs a Hellinger confidence ball in the trajectory distribution space with a radius shrinking at \(O(1/\sqrt{n})\) using expert demonstrations, then constrains online preference exploration within this ball. It proves that the online regret bound tends to zero as the offline data volume \(n\) increases and validates that the regret is lower than pure imitation or pure online preference RL on discrete and continuous MuJoCo control tasks.
- Finite-Time Analysis of Actor-Critic Methods with Deep Neural Network Approximation
-
This paper provides the first finite-time convergence analysis of the single-timescale neural Actor-Critic algorithm in continuous state-action spaces under the time-average reward setting. It proves that the reward, critic, and actor errors converge to a stationary point at a rate of \(\tilde{O}(T^{-1/2})\), and the convergence rate does not diverge with the network width \(m\).
- floq: Training Critics via Flow-Matching for Scaling Compute in Value-Based RL
-
The Q-function is reframed from a "single network mapping to a scalar" to a "velocity field flowing toward a Q-value via multi-step numerical integration." By using flow-matching to introduce dense supervision into value learning, critic capacity can be scaled by increasing integration steps (rather than just depth or width), increasing success rates by approximately 1.8x on difficult offline RL tasks.
- Flow Actor-Critic for Offline Reinforcement Learning (FAC)
-
FAC is the first to jointly utilize continuous normalizing flows to simultaneously construct expressive actor policies and a density-estimation-based critic penalty mechanism. by identifying OOD regions for selective conservative Q-value estimation, it significantly outperforms the previous best (43.6) with an average score of 60.3 across 55 OGBench tasks.
- Flow Matching Policy Gradients
-
This paper introduces Flow Policy Optimization (FPO), which integrates the conditional flow matching loss directly into the PPO-clip framework. By using the "exponential of the difference between the old and new policy CFM losses" as a proxy for the likelihood ratio, FPO enables training diffusion/flow policies from scratch using pure on-policy gradients. This approach avoids calculating exact likelihoods of flow models and is decoupled from specific samplers, achieving performance that meets or exceeds Gaussian policies in continuous control and under-conditioned humanoid control tasks.
- Flowing Through States: Neural ODE Regularization for Reinforcement Learning
-
This paper proposes FlowReg: using a neural ODE to fit a smooth and continuous trajectory flow in the latent space, and forcing the agent's state encoder to align the latent representations of adjacent states along this ODE flow via an alignment loss. This explicitly injects "environment transition dynamics" into representation learning, achieving significant performance improvements on Atari (A2C) and MiniGrid (PPO).
- From \(f(x)\) and \(g(x)\) to \(f(g(x))\): LLMs Learn New Skills in RL by Composing Old Ones
-
The paper uses a decontaminated synthetic string transformation task to demonstrate that when LLMs have mastered "atomic skills" through pre-training, as long as RL training explicitly incentivizes "composition", they can truly learn entirely new compositional skills that cannot be explained by atomic skills alone. These models generalize to deeper nesting levels and even completely different tasks—directly contradicting the pessimistic view that "RL only rearranges the existing capabilities of the base model."
- From Narrow to Panoramic Vision: Attention-Guided Cold-Start Reshapes Multimodal Reasoning
-
It is discovered that the reasoning performance of Multimodal LLMs is highly correlated with Visual Attention Scores (VAS) (\(r=0.96\)). The AVAR framework is proposed to enhance VAS through three stages: visual anchoring data synthesis, attention-guided training objectives, and visual anchoring reward shaping, achieving an average improvement of 7% across 77 benchmarks.
- From Observations to Events: Event-Aware World Models for Reinforcement Learning
-
Inspired by the cognitive science concept that "humans segment continuous sensory streams into discrete events," this paper proposes EAWM, a general framework that enables world models to additionally predict "events" (significant changes in brightness, values, or categories) alongside future observations. This allows learning compact kinematic representations, improving strong baselines such as DreamerV3 and Simulus by 10%–45% on benchmarks like Atari, Craftax, and DMC.
- From Ticks to Flows: Dynamics of Neural Reinforcement Learning in Continuous Environments
-
This paper models deep RL for continuous control as a continuous-time stochastic process. By introducing two timescales—the "environment clock" and the "gradient clock"—and employing Itô-Taylor expansion with linearized infinite-width networks, it derives the first equations for the infinitesimal evolution of the state distribution at each gradient step, ultimately simplifying the process into a closed system with only five time-varying variables.
- From Verifiable Dot to Reward Chain: Harnessing Verifiable Reference-based Rewards for RL of Open-ended Generation
-
This paper proposes the RLVRR framework, extending RLVR (Reinforcement Learning with Verifiable Rewards) from mathematical and code reasoning to open-ended text generation. By extracting keyword sequences (content reward) and executable Python check functions (style reward) from high-quality reference answers, it constructs a "Reward Chain" to replace single-point verification signals. With only 10K data, it outperforms 100K SFT and advanced reward models across more than 10 benchmarks.
- Frozen Policy Iteration: Computationally Efficient RL under Linear \(Q^{\pi}\) Realizability for Deterministic Dynamics
-
Under the mild assumption that "the Q-function of any policy is linearly representable" (linear \(Q^\pi\) realizability), this paper proposes Frozen Policy Iteration (FPI)—the first computationally and statistically efficient online RL algorithm without a simulator for deterministic MDPs. It achieves a regret of \(\tilde O(\sqrt{d^2 H^6 T})\), answering an open problem posed by Weisz et al. (2023).
- GAR: Generative Adversarial Reinforcement Learning for Formal Theorem Proving
-
GAR integrates a "statement fuser" and a "prover" into a joint adversarial RL closed-loop. The fuser is rewarded for synthesizing "harder but solvable" theorems, while the prover is rewarded for solving them. This automatically forms an implicit curriculum where the problem difficulty continuously scales with the prover's current capabilities.
- GAS: Enhancing Reward-Cost Balance of Generative Model-assisted Offline Safe RL
-
GAS enhances generative model-driven Offline Safe RL with trajectory stitching capabilities through objective functions, transition-level data augmentation/relabeling, and data reshaping. It automatically calibrates user-specified (potentially unreliable) reward-cost targets into the optimal reachable targets within the dataset that satisfy constraints, achieving higher safety under tight constraints and higher rewards under loose constraints.
- GEM: A Gym for Agentic LLMs
-
GEM is an open-source "environment simulator" for the LLM agent era—comparable to OpenAI-Gym—providing a unified environment-agent interface, asynchronous vectorized execution, rich tools, and 24 standardized multi-turn environments. It also introduces a REINFORCE + Return Batch Normalization (ReBN) baseline algorithm compatible with dense step-wise rewards and arbitrary discount factors.
- General search techniques without common knowledge for imperfect-information games, and application to superhuman Fog of War chess
-
This paper proposes Obscuro, which extends real-time imperfect-information search to Fog of War chess by employing knowledge-limited subgame solving that avoids enumerating common knowledge sets, single-sided GT-CFR expansion, and policy purification, achieving superhuman performance in this game for the first time.
- Generalization of RLVR Using Causal Reasoning as a Testbed
-
This paper uses "probabilistic inference on causal graphs" as a strictly verifiable microscope to decompose the generalization advantages of RLVR (Reinforcement Learning from Verifiable Rewards) over SFT. The findings suggest that RLVR's benefits emerge only when the model possesses sufficient initial reasoning capability, primarily manifesting through improved marginalization strategies and reduced intermediate derivation errors.
- Geometric-Mean Policy Optimization
-
This work replaces the "arithmetic mean" used in GRPO for optimizing token-level rewards with a "geometric mean." By leveraging the inherent robustness of the geometric mean to outliers, the method suppresses extreme importance sampling ratios, thereby stabilizing policy updates without sacrificing exploration capability. Mathematically, it achieves a Pass@1 improvement of up to 4.1% over GRPO in reasoning tasks.
- Geometry of Uncertainty: Learning Metric Spaces for Multimodal State Estimation in RL
-
Uncertainty estimation is reformulated as a geometric problem in metric spaces—constructing a latent space where Euclidean distance represents the "minimum number of actions between two states," then fusing multimodal sensors via inverse distance weighting. This achieves robust state estimation against unseen sensor corruptions without any noise assumptions or training on noisy data.
- GEPO: Group Expectation Policy Optimization for Stable Heterogeneous Reinforcement Learning
-
Addressing the training instability of LLM reinforcement learning in decentralized environments with high network latency, this paper proposes coarsening importance weight granularity from the token/sequence level to the group level (using group-wise expected probabilities as the denominator). Theoretically, this exponentially reduces importance weight variance against high KL divergence, maintaining performance with only a 3% drop under 1800s latency.
- Getting Your LLMs Ready for Reinforcement Learning with Lightweight SFT
-
This paper reveals that the "optimal SFT checkpoint" and the "best RL starting point" are inconsistent during the RL cold-start phase—models lose RL potential due to distribution forgetting even while evaluation scores are still rising. It proposes using diversity metrics (Entropy / self-BLEU) for early stopping and designs an adaptive weighted loss (AESL) at the token and sub-sequence levels to balance new pattern learning with base model distribution preservation.
- Goal Reaching with Eikonal-Constrained Hierarchical Quasimetric Reinforcement Learning
-
The discrete per-transition local constraints of Quasimetric RL are reformulated into continuous-time Eikonal Partial Differential Equation (PDE) constraints (where the gradient norm is 1). This makes value learning "trajectory-free," requiring only sampled states and goals. A hierarchical structure is integrated to alleviate failures under complex dynamics, achieving SOTA on OGbench navigation tasks.
- Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction
-
By integrating a trio of "Scaffolded Data Synthesis + Compiler Feedback-driven Self-Correction + Model Averaging," an open-source Lean theorem prover achieves a new SOTA. The 8B model outperforms the 671B DeepSeek-Prover-V2, and the 32B model reaches a 90.4% pass@32 on MiniF2F with 20x fewer parameters and a significantly lower computational budget.
- GoldenStart: Q-Guided Priors and Entropy Control for Distilling Flow Policies
-
GoldenStart (GS-flow) enhances single-step distilled flow-matching policies by implementing two mechanisms: relocating the generated "starting noise" to high-value regions ("Golden Start") via a Q-guided conditional VAE, and transforming the deterministic actor into a controllable stochastic distribution using entropy regularization. This addresses the challenges of "precise exploitation" and "online exploration" while maintaining single-step inference speed.
- GRACE: A Language Model Framework for Explainable Inverse Reinforcement Learning
-
GRACE replaces the black-box neural network reward models in Inverse Reinforcement Learning (IRL) with "executable Python code." It utilizes code LLMs within an evolutionary search to infer readable and verifiable reward functions using only expert trajectories, without requiring task descriptions or ground-truth rewards.
- GRACE: Generative Representation Learning via Contrastive Policy Optimization
-
GRACE reinterprets contrastive learning signals from "losses to be minimized" as "rewards guiding a generative policy." It requires the LLM to first write readable "understanding rationales" for input text before performing mean-pooling on hidden states to obtain embeddings. Using GRPO-style policy gradients to maximize query–positive similarity and minimize query–negative similarity, it significantly improves embedding quality on MTEB while preserving the model's generation and reasoning capabilities.
- Graph-Theoretic Intrinsic Reward: Guiding RL with Effective Resistance
-
The agent's local perception is modeled as a time-evolving graph. The change in Effective Resistance (\(R_{eff}\)) between the "agent node" and "goal node" on this graph serves as a dense intrinsic reward. This provides a theoretically grounded, on-policy guiding signal for sparse reward exploration from a spectral graph theory perspective, without requiring pre-training.
- GraphOmni: A Comprehensive and Extensible Benchmark Framework for Large Language Models on Graph-theoretic Tasks
-
Constructs the GraphOmni benchmark framework to systematically evaluate the graph-theoretic reasoning capabilities of 11 LLMs across 241K queries spanning 7 graph types × 7 serialization formats × 9 prompting strategies. It reveals complex interaction effects among these three dimensions and designs an RL-guided combinatorial search method that maintains approximately 90% optimal accuracy at only 25% of the cost.
- GRL-SNAM: Geometric Reinforcement Learning with Differential Hamiltonians for Navigation and Mapping in Unknown Environments
-
The paper reformulates "navigation + mapping" as a Hamiltonian energy optimization problem on the cotangent bundle. Control actions are generated directly from the gradients of a learned energy landscape, replacing the Bellman bootstrapping common in mainstream RL. This allows for high-quality navigation with only local observations and minimal global mapping, while generalizing well to unseen environments.
- Group Verification-based Policy Optimization for Interactive Coding Agents
-
GVPO overlays a "process-verifiable" shaping term onto the group relative advantage of GRPO, directly injecting deterministic intermediate feedback (code execution success/failure) into step-wise advantage. This corrects credit assignment bias caused by sparse outcome rewards, enabling a 32B agent to outperform OpenAI o1 on AppWorld.
- Grouping Nodes with Known Value Differences: A Lossless UCT-based Abstraction Algorithm
-
This paper proposes KVDA-UCT, which relaxes MCTS abstraction from "merging nodes with equal values" to "merging nodes whenever their value difference can be inferred." Without introducing new parameters or sacrificing precision, it discovers significantly more abstractions than the current state-of-the-art OGA-UCT, thereby improving sample efficiency in deterministic environments.
- Guided Flow Policy: Learning from High-Value Actions in Offline Reinforcement Learning
-
GFP upgrades the dual-policy BRAC framework (Flow Matching BC + 1-step distilled actor) to a value-aware version: it uses a critic and an actor to assign soft scores to dataset actions, ensuring behavior cloning prioritizes high-value actions rather than indiscriminately cloning all state-action pairs. This approach achieves SOTA results across 144 offline RL tasks.
- Guided Policy Optimization under Partial Observability
-
To address the imitation gap often encountered when "distilling a teacher trained with privileged information into a student," the GPO framework is proposed. It enables a guider (using privileged information) and a learner (observing partial information) to perform simultaneous co-training. Through "backtracking" constraints, the guider is consistently pulled back into a range that the learner can imitate, providing a theoretical guarantee that the student's supervised learning is equivalent to direct RL, thereby fully utilizing privileged information without leaving behind an "impossibly good teacher" that cannot be learned.
- Helix: Evolutionary Reinforcement Learning for Open-Ended Scientific Problem Solving
-
The authors propose HELIX, a framework that combines Reinforcement Learning (GRPO) with Evolutionary Algorithms (NSGA-II) for open-ended scientific problem solving. RL iteratively optimizes the policy, evolutionary mechanisms balance solution quality and diversity, and in-context learning utilizes historical solutions to guide exploration. Using only a 14B model, it outperforms GPT-4o pipelines across 20 tasks, including circle packing and machine learning optimization.
- Heterogeneous Agent Q-weighted Policy Optimization
-
HAQO integrates sequential advantage updates, Q-weighted diffusion policies, and entropy regularization into a unified framework. This allows heterogeneous agents to represent multimodal policies using diffusion models while ensuring monotonic improvement of joint returns, similar to trust region methods.
- How Far Can Unsupervised RLVR Scale LLM Training?
-
This paper provides a comprehensive analysis of Unsupervised Reinforcement Learning via Verifiable Rewards (URLVR), revealing that all intrinsic reward methods essentially "sharpen" the model's initial distribution. This leads to an inevitable "rise-then-fall" collapse pattern. The authors propose the Model Collapse Step as a prior metric for model trainability and suggest that external reward methods are the key to breaking the scalability bottleneck.
- How to Lose Inherent Counterfactuality in Reinforcement Learning
-
This paper demonstrates through both theoretical analysis and Atari experiments that standard reinforcement learning naturally learns ordered counterfactual values for non-executed actions, whereas robust training that explicitly pursues \(\epsilon\)-local invariance distorts the Q-function, reshuffles suboptimal actions, causes value overestimation, and forces the policy to lose this counterfactual capability.
- Imitation Learning as Return Distribution Matching
-
This paper reformulates risk-sensitive imitation learning as a "matching the complete return distribution of the expert" problem. It designs two algorithms, RS-BC and RS-KT, with sample complexity guarantees using a class of non-Markovian policies that depend on cumulative returns in tabular MDPs.
- Improving and Accelerating Offline RL in Large Discrete Action Spaces with Structured Policy Initialization
-
SPIN decouples "learning action structure" from "learning control"—first utilizing a BERT-like masked self-supervised objective to pre-train an Action Structure Model (ASM) that characterizes the low-dimensional manifold of valid joint actions, then freezing this representation to train a lightweight policy head. This approach improves average returns by up to 39% and accelerates convergence by up to 12.8x in offline RL across exponentially large discrete combinatorial action spaces.
- Improving Human-AI Coordination through Online Adversarial Training and Generative Models
-
GOAT integrates a frozen cooperative policy generative model (VAE) into an online adversarial training loop, where the adversary searches for "regret-maximizing" partners within the latent space of the generative model. This approach continuously exposes the weaknesses of the cooperative agent without degrading into self-sabotage, achieving SOTA results in human evaluations on Overcooked.
- In-Context Compositional Q-Learning for Offline Reinforcement Learning
-
ICQL reframes Q-learning in offline RL as an "in-context inference" problem—given a query state, it retrieves the top-k similar transitions from the offline dataset and uses a linear Transformer to infer a local Q-function on the fly from this local context. This bypasses the difficulty of fitting a single global Q-network to all sub-tasks, achieving improvements of up to 16.4%, 8.8%, and 6.3% on D4RL Kitchen, MuJoCo, and Adroit, respectively.
- Information-based Value Iteration Networks for Decision Making Under Uncertainty
-
This paper proposes VI2N (Value Iteration with Value of Information Network), which implements the "Pairwise Heuristic" as a differentiable convolutional network module. This enables Value Iteration Networks, for the first time, to learn strategies that "resolve uncertainty before collecting rewards" in partially observable navigation environments with high perceptual ambiguity.
- Information Gain-based Policy Optimization: A Simple and Effective Approach for Multi-Turn Search Agents
-
IGPO treats each turn of agent-environment interaction as a process of "approximating the ground truth." It uses the increment in the model's own confidence regarding the ground truth as a turn-level dense reward. This approach mitigates the advantage collapse issue caused by sparse outcome rewards in multi-turn RL without requiring external reward models or Monte Carlo estimation.
- Instance-Dependent Fixed-Budget Pure Exploration in Reinforcement Learning
-
This paper conducts the first study on the fixed-budget pure exploration problem in MDPs and proposes the BREA algorithm. By taking only the interaction budget \(B\) as input, it provides an instance-dependent "\(\epsilon\)-uniform" failure probability upper bound (valid for all precision levels \(\epsilon\) exceeding a budget-related threshold), liberating policy identification from the PAC paradigm that requires pre-specifying precision \(\epsilon\) and confidence \(\delta\).
- Instance-wise Adaptive Scheduling via Derivative-Free Meta-Learning
-
Addressing the issue where Deep Reinforcement Learning (DRL) scheduling models "only optimize average performance and perform sub-optimally on individual instances," this paper utilizes MAML meta-learning to train an initial model "born for fine-tuning." By replacing both inner and outer loop optimizations with derivative-free Evolutionary Strategies (ES) and leveraging GPU parallelism, the model performs full-parameter adaptive search for each instance during testing, significantly outperforming test-time methods like Active Search and EAS.
- Inter-Agent Relative Representations for Multi-Agent Option Discovery
-
This paper proposes a relative representation focused on inter-agent relationships for joint state abstraction. It first estimates a "Fermat state" that minimizes team-wide alignment costs, then utilizes dimension-wise temporal distances from each agent to this state as a new representation. Upon this representation, Graph Laplacian eigenoption decomposition is performed to discover a smaller number of highly coordinated multi-agent joint options.
- Is Pure Exploitation Sufficient in Exogenous MDPs with Linear Function Approximation?
-
This paper proves that in Exogenous MDPs (Exo-MDP), where uncertainty arises solely from exogenous inputs independent of agent actions, a pure exploitation (no exploration) strategy can achieve sublinear regret. Specifically, the PTO algorithm achieves \(\tilde{O}(H^2|\Xi|\sqrt{K})\) in the tabular case, and for linear function approximation, the LSVI-PE algorithm's regret is polynomially related to feature dimensions and exogenous state space while being independent of the endogenous state/action space size.
- J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning
-
J1 unifies "subjective/objective judgment tasks" into a format with verifiable rewards, using GRPO online RL to train LLM judges that "think before rendering a verdict." At a 32B scale, it surpasses o3 and DeepSeek-R1-671B on multiple reward benchmarks and eliminates position bias using purely synthetic data.
- Jackpot: Align Actor-Policy Distribution for Scalable and Stable RL for LLM
-
Jackpot utilizes "Optimal Budgeted Rejection Sampling (OBRS)" to directly align the actor (rollout) distribution with the policy (training) distribution. Combined with Top-K probability estimation and a stabilized Jackpot-PPO loss, it enables stable convergence for LLM reinforcement learning under extreme off-policy settings, including large-batch, asynchronous, and even "disparate model" rollout/training configurations.
- Kevin: Multi-Turn RL for Generating CUDA Kernels
-
This work models the inherently iterative engineering task of "writing GPU kernels" as a multi-turn RL problem. It enables credit assignment within each generation-execution-refinement cycle. Kevin, the first model optimized for CUDA kernels via multi-turn RL, improves accuracy from 56% to 82% and average speedup from 0.53x to 1.10x, surpassing frontier models such as o4-mini.
- KL-Regularized Reinforcement Learning for Generative Modelling is Designed to Mode Collapse
-
This paper proves from a variational inference perspective that diversity collapse in KL-regularized RL is not an optimization failure but an inherent property of the target distribution being constructed as unimodal. Under common hyperparameters, even a perfect global optimum will collapse to a single high-reward mode. Based on this, the authors propose MARA (Mode-Anchored Reward Augmentation), which spreads the target distribution uniformly across all high-reward regions with just two lines of code change.
- Koopman-Assisted Trajectory Synthesis: A Data Augmentation Framework for Offline Imitation Learning
-
KATS models expert closed-loop behavior as linear dynamics in a Koopman latent space and synthesizes new trajectories using latent space symmetry transformations that commute with these dynamics. By augmenting these with actions via an inverse dynamics model, KATS significantly improves policy performance in offline imitation learning and few-shot offline reinforcement learning tasks with low data diversity.
- LadderSym: A Multimodal Interleaved Transformer for Music Practice Error Detection
-
The LadderSym architecture is proposed to solve music practice error detection. By overcoming alignment deficiencies in late fusion via an interleaved cross-stream alignment module (Ladder) and reducing frequency ambiguity in pure audio scores with symbolic score prompting (Sym), it improves the omission F1 from 26.8% to 56.3% on MAESTRO-E.
- LaSeR: Reinforcement Learning with Last-Token Self-Rewarding
-
LaSeR compresses the LLM's judgment of its answer correctness into the log-prob of a special token following the last answer token. By aligning this last-token self-rewarding score to verifier rewards using an MSE auxiliary loss, it simultaneously enhances RLVR reasoning capabilities and test-time self-verification with minimal additional inference cost.
- Latent Wasserstein Adversarial Imitation Learning
-
The authors propose LWAIL, which utilizes ICVF to learn dynamics-aware latent representations from a small amount of random data. By upgrading the "ground metric" of the Wasserstein distance from Euclidean distance to latent space distance, the method achieves expert-level imitation performance using only a single state trajectory.
- Learn More with Less: Uncertainty Consistency Guided Query Selection for RLVR
-
This paper introduces active learning into RLVR mathematical reasoning training. It identifies that the alignment between "model-perceived difficulty" and "objective error probability" is crucial for training value. By employing offline \(r_{pb}\) and online \(r^{online}_{pb}\) metrics, the method achieves performance close to or exceeding full-data RLVR training using only 30% of queries.
- Learn the Ropes, Then Trust the Wins: Self-imitation with Progressive Exploration for Agentic Reinforcement Learning
-
SPEAR employs "curriculum-scheduled self-imitation learning + intrinsic reward shaping" to enable agentic LLMs to explore boldly through tool interactions in early training and exploit successful experiences robustly in later stages. It achieves a progressive exploration-exploitation balance without relying on external expert demonstrations, adhering to the principle of "learning the ropes first, then trusting the results."
- Learn to Reason Efficiently with Adaptive Length-based Reward Shaping
-
Ours unify various RL methods for "compressing long reasoning chains" into a "Length-based Reward Shaping" framework. From this perspective, a step-wise reward LASER and its dynamic, difficulty-aware version LASER-D are proposed. Across five reasoning models (1.5B–32B), these methods simultaneously improve accuracy and token efficiency (e.g., +5.3 accuracy and -64% tokens on AIME24).
- Learning Dynamics Feature Representation via Policy Attention for Dynamic Path Planning in Urban Road Networks
-
To address the dilemma in RL-based dynamic path planning where "global dynamic information is complete but expensive, while local dynamic information is efficient but misses key information," this paper proposes a hierarchical distillation approach. By using "Policy Attention to filter task-related subgraphs + n-hop neighborhoods to extract node-level local features," high-dimensional global dynamics are compressed into compact, approximately Markovian states, improving both speed and quality for any RL backbone.
- Learning from Synthetic Data Improves Multi-hop Reasoning
-
It is discovered that RLVR training on synthetic data generated from completely fictional rules significantly improves LLM performance on real-world multi-hop reasoning tasks (Qwen3-0.6B improves by 56%-131%). This occurs because the model learns the general reasoning skill of knowledge composition rather than memorizing factual knowledge.
- Learning From the Past with Cascading Eligibility Traces
-
This paper generalizes traditional exponentially decaying eligibility traces into cascading eligibility traces (CET) composed of multiple stages connected in series. This allows synaptic memory to peak near a specified delay \(T\), thereby more accurately attributing error signals to past activity in scenarios involving second-scale behavioral feedback and minute-scale retrograde axonal signals.
- Learning Human Habits with Rule-Guided Active Inference
-
This work extends active inference (AIF) into a "habit-forming" framework: it employs a bio-inspired wake–sleep algorithm to jointly learn world models and symbolic rules under a unified free energy objective. This allows agents to react instantaneously using high-confidence rules in familiar contexts while falling back to EFE planning in novel scenarios, resulting in more accurate and faster human behavior prediction and interpretable "habits."
- Learning Massively Multitask World Models for Continuous Control
-
The authors propose MMBench (200 tasks across 10 domains), the first benchmark for "massively multi-task online RL," and Newt, a language-conditioned world model based on TD-MPC2. By following a foundation model paradigm of "pre-training with demonstrations followed by joint online interactive optimization across all tasks," they demonstrate that a single agent can indeed learn hundreds of continuous control tasks simultaneously using online RL.
- Learning to Be Uncertainty: Pre-training World Models with Horizon-Calibrated Uncertainty
-
Addressing the issue where world models are "forced to predict a single deterministic future" during action-free video pre-training, this paper proposes HAUWM. It utilizes an ensemble of dynamics heads with variable horizon prediction and explicitly compels prediction variance to grow monotonically with the prediction horizon via a Horizon-Calibrated Uncertainty (HCU) loss. This allows the model to learn a latent space with "temporal confidence decay awareness," significantly outperforming the SOTA on downstream control tasks.
- Learning to Generate Unit Test via Adversarial Reinforcement Learning
-
The UTRL framework is proposed to iteratively train a unit test generator and a code generator through adversarial RL. The test generator learns to produce discriminative test cases that distinguish LLM-generated code from correct code, while the code generator learns to pass these tests. After training, Qwen3-4B surpasses GPT-4.1 in test generation quality.
- Learning to Orchestrate Agents in Natural Language with the Conductor
-
A 7B Qwen2.5 model is trained as a "Conductor" using GRPO to output complete Agent workflows (subtask instructions + worker assignment + communication topology access lists) in natural language. Coordinating frontier models like GPT-5/Claude Sonnet 4/Gemini 2.5 Pro, it achieves an average of 77.27% across 7 reasoning benchmarks with only 960 questions × 200 training iterations, surpassing all single models (GPT-5 at 74.78%) and multi-agent baselines.
- Learning to Play Multi-Follower Bayesian Stackelberg Games
-
This work provides the first systematic study of the online learning problem in Multi-Follower Bayesian Stackelberg Games (BSG). By employing a geometric partition of the leader's strategy space into "Best Response Regions," the authors achieve a regret bound of \(\tilde{O}(\sqrt{\min\{L, nK\} \cdot T})\) under type feedback. Importantly, this bound does not grow polynomially with the number of followers \(n\). An almost matching lower bound of \(\Omega(\sqrt{\min\{L, nK\}T})\) is also established.
- Learning to Reason as Action Abstractions with Scalable Mid-Training RL
-
This paper provides the first theoretical characterization of "how mid-training shapes post-training RL," pointing out that effective mid-training should occur within temporal action abstractions rather than the raw token space. Based on this, it proposes RA3—a scalable mid-training algorithm that discovers latent reasoning structures via self-supervised RL and feeds them back through SFT.
- Learning to Reason Efficiently with Discounted Reinforcement Learning
-
Verifiable reward reasoning in LLMs is modeled as a finite-horizon stochastic shortest path (SSP) MDP. By applying a discount factor \(\gamma < 1\) only to reasoning tokens, the authors use Blackwell optimality to prove that if \(\gamma\) is sufficiently close to 1, the discounted optimal policy first maximizes accuracy and then selects the shortest trajectory among all correct ones—achieving "lossless reasoning compression."
- Learning What Matters Now: Dynamic Preference Inference under Contextual Shifts
-
This work models preference weights in multi-objective RL (MORL)—traditionally treated as "known constants"—as latent variables that drift with context. It maintains a posterior belief of "what matters now" via online variational inference and jointly trains this with a preference-conditioned actor–critic, enabling agents to rapidly reprioritize goals following event-driven distribution shifts.
- Less is More: Clustered Cross-Covariance Control for Offline RL
-
This paper reveals that the standard mean squared error (MSE) objective in offline RL introduces harmful TD cross-covariance. It proposes the C⁴ (Clustered Cross-Covariance Control for TD) method, which suppresses this effect through partitioned buffer sampling and explicit gradient correction penalties, achieving up to 30% return improvement in small dataset and OOD-dominated scenarios.
- Leveraging Explanation to Improve Generalization of Meta Reinforcement Learning
-
Adopts a strategy analogous to "humans reviewing the most relevant previous problems after making mistakes": first use example-based explanations to identify "critical training tasks" most relevant to poorly adapted tasks, then use conditional mutual information (CMI) to guide the meta-strategy to "pay more attention" to these tasks. By learning an optimal mixup augmentation distribution to encode more critical task information into meta-parameters, the model post-hoc rectifies the unbalanced generalization in meta-reinforcement learning.
- Local Reinforcement Learning with Action-Conditioned Root Mean Squared Q-Functions
-
Inspired by the "goodness function" of the Forward-Forward algorithm, this paper proposes ARQ (Action-conditioned Root mean squared Q-functions)—reading the scalar Q-value directly as the "root mean square after subtracting the mean (i.e., standard deviation)" of the hidden vector output from each cell in local RL. By conditioning the model on the action via one-hot concatenation at the input, it removes the constraint in previous BP-free methods where the "output dimension must equal the number of actions." On MinAtar and DeepMind Control, it outperforms the SOTA local RL method AD and beats BP-trained DQN/SAC on most tasks.
- LongRLVR: Long-Context Reinforcement Learning Requires Verifiable Context Rewards
-
LongRLVR is proposed to introduce verifiable context rewards into RLVR training, addressing the gradient vanishing problem in context grounding caused by outcome-only rewards in long-context scenarios, thereby significantly enhancing the long-context reasoning capabilities of LLMs.
- LongWriter-Zero: Mastering Ultra-Long Text Generation via Reinforcement Learning
-
LongWriter-Zero is proposed: starting from a base model and without relying on any annotated or synthetic data, ultra-long high-quality text generation capability emerges solely through GRPO reinforcement learning combined with a three-dimensional composite reward model (Length / Quality / Format). With 32B parameters, it outperforms 100B+ models such as DeepSeek-R1 and Qwen3-235B on WritingBench.
- Look-ahead Reasoning with a Learned Model in Imperfect Information Games
-
This paper proposes LAMIR, which learns an imperfect-information game model with abstraction from interaction trajectories without explicit game rules. This allows the MuZero-style "learn a model then perform look-ahead reasoning" paradigm to operate in large-scale imperfect-information games in a theoretically sound manner for the first time.
- Lookahead Tree-Based Rollouts for Enhanced Trajectory-Level Exploration in Reinforcement Learning with Verifiable Rewards
-
LATR replaces independent token-level stochastic sampling in RLVR with a "branching-lookahead simulation-pruning" tree-based rollout. This explicitly generates trajectory-level diversity under a fixed generation budget, accelerating GRPO/DAPO training by 131% and improving final pass@1 by 4.2%.
- LoongRL: Reinforcement Learning for Advanced Reasoning over Long Contexts
-
LoongRL is proposed, which utilizes synthesized KeyChain data for reinforcement learning to induce the emergence of "plan–retrieve–reason–recheck" patterns in LLMs for long-context reasoning. Models trained solely on 16K contexts generalize to 128K; the 14B model achieves a score of 74.2, nearing the performance of o3-mini (74.5) and DeepSeek-R1 (74.9).
- MAGE: Multi-scale Autoregressive Generation for Offline Reinforcement Learning
-
Transferring the "Visual Autoregressive (VAR)" paradigm from the image domain to trajectory modeling in offline RL: a coarse-grained global trajectory sketch is generated first, followed by layer-wise autoregressive refinement to fine granularity. This approach simultaneously ensures global coherence and local controllability in long-horizon sparse-reward tasks.
- MARL2Grid-TR: A Multi-Agent RL Benchmark in Power Grid Operations
-
This paper introduces MARL2Grid-TR, the first multi-agent RL benchmark for "topology optimization + redispatching/curtailment" control in realistic transmission grids. Based on the high-fidelity Grid2Op simulation platform from a French TSO, it models power grid control as a multi-agent collaborative task. Experiments demonstrate that mainstream MARL methods fail significantly under realistic constraints, particularly in high-dimensional topology tasks.
- MARS-Sep: Multimodal-Aligned Reinforced Sound Separation
-
MARS-Sep reformulates query-conditioned sound separation as a reinforcement learning problem. It performs stochastic decision-making in the time-frequency domain via a factorized Beta mask policy and utilizes a progressively aligned multimodal encoder to provide semantic reward signals, achieving simultaneous improvements in signal fidelity and semantic consistency.
- Masked Skill Token Training for Hierarchical Off-Dynamics Transfer
-
MSTT abstracts the condition where "structural changes in the environment render certain skills unexecutable" into a binary skill mask. It utilizes VQ-VAE to segment trajectories into discrete skill tokens, trains a "feasibility-aware" critic by simulating dynamics drift with random masks, and employs a diffusion trajectory generator for feasibility filtering. This allows for zero-shot transfer to new environments with structural changes using only a single observation-only demonstration (without action labels) from the target environment.
- Master Skill Learning with Policy-Grounded Synergy of LLM-based Reward Shaping and Exploring
-
PoRSE enables the LLM to not only generate target-oriented rewards but also design an "affordance state space" to drive task-related exploration. Through an online policy improvement process that dynamically weights both components, it establishes a new Prev. SOTA on 24 robotic manipulation/locomotion tasks and successfully solves two previously intractable complex tasks.
- Mastering Sparse CUDA Generation through Pretrained Models and Deep Reinforcement Learning
-
SparseRL treats pretrained code LLMs as stochastic policies and the compiler+executor as the environment. It utilizes PPO with hierarchical rewards (compilation/correctness/execution efficiency) to end-to-end learn high-performance SpMV/SpMM CUDA code for dynamic sparse matrix inputs, achieving a ~20% increase in compilation rate and an average 30% speedup in generated code.
- MATH-Beyond: A Benchmark for RL to Expand Beyond the Base Model
-
The authors argue that popular mathematical reasoning benchmarks (MATH-500, AIME24) are already almost entirely solved by open-source base models under \(pass@1024\). Consequently, RL fine-tuning merely "sharpens" existing solutions rather than "discovering" new capabilities. To address this, they constructed MATH-Beyond—a set of high school competition problems that \(\le 8B\) open-source models consistently fail to solve even with \(1024\) samples—shifting the evaluation focus from "improving \(pass@k\)" to "expanding the reasoning boundaries of base models."
- Mean Flow Policy with Instantaneous Velocity Constraint for One-step Action Generation
-
This paper introduces the "mean velocity field" into RL policies, enabling multi-modal optimal action generation from Gaussian noise via one-step sampling. A proposed Instantaneous Velocity Constraint (IVC) addresses the missing boundary conditions to ensure learning accuracy, maximizing training and inference speed while preserving the expressiveness of flow-based policies.
- Menlo: From Preferences to Proficiency – Evaluating and Modeling Native-like Quality Across 47 Languages
-
The Menlo framework is proposed to decompose native-like response quality into four dimensions based on Audience Design theory. It constructs a dataset of 6,423 annotated preference pairs across 47 language variants and demonstrates that an LLM judge trained with pairwise evaluation and Reinforcement Learning (RL) can achieve performance nearing that of human annotators.
- MergeMix: A Unified Augmentation Paradigm for Visual and Multi-Modal Understanding
-
MergeMix proposes a mixup data augmentation method based on token merging. It generates mixed images in the attention space through bipartite soft matching and uses the mixing ratio as a soft margin in preference optimization, unifying SFT and RL training paradigms for both image classification and Multi-modal Large Language Models (MLLMs).
- Minimax Optimal Adversarial Reinforcement Learning
-
This paper provides the first proof that sublinear regret remains achievable in episodic MDPs where transition kernels are chosen arbitrarily by an adversary (fully adversarial). It proposes the AD-FTRL algorithm, which reduces regret to \(\tilde{O}(\sqrt{(|S||A|)^K T})\), and establishes minimax optimality by constructing a matching lower bound.
- MIRA: Memory-Integrated Reinforcement Learning Agent with Limited LLM Guidance
-
MIRA amortizes LLM sub-goal decomposition and trajectory priors into a continuously evolving memory graph, from which utility signals are derived to softly shape advantage estimation. This accelerates learning during the early stages of sparse rewards and decays the shaping term over training to preserve PPO convergence—achieving performance close to "per-step LLM querying" methods with only a few dozen offline/online queries.
- MIRACLE: Model-free Imitation and Reinforcement Learning for Adaptive Cut-Selection
-
Treating the Mixed-Integer Programming (MIP) solver SCIP as the environment and its default cut selection heuristic as the expert, this work utilizes GAIL to learn a dense reward function and PPO to train a lightweight cut selection policy. By selecting only a few high-value cuts within a budget per round, the approach compresses peak memory from GB-level to dozens of MBs (up to 98.5% reduction) while achieving 100% solving success rate and an average 3.78× speedup on MIPLIB.
- Mirage or Method? How Model–Task Alignment Induces Divergent RL Conclusions
-
This paper argues that a series of recent "counter-intuitive" reinforcement learning (RL) conclusions for LLMs—such as the effectiveness of spurious rewards, one-shot RL matching full datasets, and sufficiency of pure negative sample training—are not universal laws of RL. Instead, they hold only when the model itself is already proficient in the task (strong model-task alignment, measured by pass@k). Once a task exceeds the model's capabilities, these techniques fail, and only standard RL with correct rewards remains robust.
- Mixture-of-World Models: Scaling Multi-Task Reinforcement Learning with Modular Latent Dynamics
-
By integrating task-specific VAEs, a mixture of Transformer experts, and a shared backbone into a "Mixture-of-World models" (MoW) architecture—augmented with gradient clustering and harmony loss—the authors train a single agent to simultaneously master 26 Atari games and 50 Meta-World tasks. The performance approaches that of an ensemble of 26 single-task models while reducing parameters by half.
- MOBODY: Model-Based Off-Dynamics Offline Reinforcement Learning
-
MOBODY shifts focus in "off-dynamics offline RL" from "filtering/penalizing high-offset source data" to "directly learning an accurate target domain dynamics model for rollout exploration." It employs dual action encoders with shared state/transition functions to learn target dynamics, combined with target Q-weighted behavior cloning for policy optimization, achieving average improvements of 25%–44% on MuJoCo/Adroit.
- ROMI: Model-based Offline RL via Robust Value-Aware Model Learning with Implicitly Differentiable Adaptive Weighting
-
ROMI achieves robust value-aware model learning by transforming the dynamics uncertainty set into a state uncertainty set via Wasserstein duality. It utilizes an implicitly differentiable adaptive weighting mechanism to balance dynamics accuracy and value awareness, effectively solving the Q-value underestimation and gradient explosion issues inherited from RAMBO, and reaches SOTA performance for model-based offline RL on D4RL and NeoRL.
- Model Predictive Adversarial Imitation Learning for Planning from Observation
-
The authors propose MPAIL (Model Predictive Adversarial Imitation Learning), which embeds an MPPI planner into the adversarial imitation learning loop. This represents the first end-to-end Planning-from-Observation framework that significantly outperforms policy-based AIL methods in generalization, robustness, interpretability, and sample efficiency, while successfully deploying in real-world robot navigation from a single observation-only demonstration.
- Multi-Agent Guided Policy Optimization
-
MAGPO utilizes an autoregressive joint "guider" policy for centralized coordinated exploration and constrains it via KL alignment to within the reach of decentralized "learner" policies. This preserves CTDE deployability while providing theoretical guarantees for monotonic policy improvement.
- Multimodal LLM-assisted Evolutionary Search for Programmatic Control Policies
-
MLES treats Multimodal Large Language Models (MLLMs) as "strategy programmers capable of watching replays." Combined with evolutionary search, it directly generates readable programmatic control policies. By using execution screens (Behavioral Evidence) to diagnose failure modes and perform targeted code modifications, it achieves performance comparable to PPO on Lunar Lander and Car Racing while remaining fully transparent and traceable.
- Near-Optimal Second-Order Guarantees for Model-Based Adversarial Imitation Learning
-
This paper proposes the MB-AIL (Model-Based Adversarial Imitation Learning) algorithm and establishes horizon-free second-order sample complexity upper bounds under general function approximation. Combined with newly constructed information-theoretic lower bounds on difficult instances, it proves that MB-AIL achieves minimax optimality (up to logarithmic factors) in online interaction sample complexity.
- Nearly-Optimal Bandit Learning in Stackelberg Games with Side Information
-
By linearizing the leader's utility space in Stackelberg games, this paper proposes a reduction framework to linear contextual bandit problems, improving the regret bound from \(\tilde{O}(T^{2/3})\) to a near-optimal \(\tilde{O}(T^{1/2})\) under the bandit feedback setting with side information.
- Negotiated Reasoning: On Provably Addressing Relative Over-Generalization
-
This paper formally defines the "Relative Over-generalization (RO)" problem in MARL for the first time and proves that RO can be avoided if the "consistent reasoning" condition is satisfied. It further proposes SVNR, a negotiated reasoning algorithm based on Stein Variational Gradient Descent, which is the first MARL method capable of provably eliminating RO.
- Neural Predictor-Corrector: Solving Homotopy Problems with Reinforcement Learning
-
This paper unifies four seemingly unrelated challenges—robust optimization, global optimization, polynomial root-finding, and sampling—into a "homotopy" paradigm. It demonstrates that their solvers share a "predictor-corrector (PC)" structure and introduces NPC, a universal neural solver that replaces hand-designed step-size and termination heuristics with reinforcement learning to achieve cross-instance generalization and plug-and-play capability.
- Neural+Symbolic Approaches for Interpretable Actor-Critic Reinforcement Learning
-
NSAC replaces the black-box actor in A2C with "additive rule ensembles." It uses a neural network critic for value estimation, while a set of IF-THEN rules directly handles decision-making. Rules are learned online via policy gradients and Orthogonal Gradient Boosting (OGB), achieving performance comparable to black-box methods like DQN, PPO, and A2C while maintaining intrinsic interpretability.
- Object-Centric World Models from Few-Shot Annotations for Sample-Efficient Reinforcement Learning
-
OC-STORM utilizes frozen video segmentation foundation models (Cutie/SAM2) to extract compact vector features of decision-critical objects from minimal annotations (6–12 frames). By feeding these into a world model, it focuses modeling capacity on small but crucial objects, significantly outperforming the STORM baseline on Atari 100k and visually complex Hollow Knight boss fights, achieving SOTA-level sample efficiency.
- Occupancy Reward Shaping: Improving Credit Assignment for Offline Goal-Conditioned Reinforcement Learning
-
This paper proposes Occupancy Reward Shaping (ORS), which first learns a generative model of "occupancy measures" (future state distributions) using flow matching, and then extracts the world geometry implicit in this model (shortest-path distances from states to goals) via optimal transport into a dense reward. This significantly alleviates the credit assignment challenge in sparse-reward offline GCRL—achieving an average 2.2× improvement across 13 long-horizon tasks, while provably preserving the optimal policy.
- OCTAX: Accelerated CHIP-8 Arcade Environments for Reinforcement Learning in JAX
-
OCTAX uses JAX to port the 1970s CHIP-8 virtual machine to GPUs for end-to-end vectorized simulation, providing 21 classic arcade games with image observations as RL environments. It achieves 350,000 env-steps/s (1.4 million frames/s) on consumer-grade GPUs, outperforming the CPU-based EnvPool by 14×, and features a pipeline for automated generation of new CHIP-8 environments via LLMs.
- Off-Policy Safe Reinforcement Learning with Constrained Optimistic Exploration
-
This paper proposes COX-Q, an off-policy safe reinforcement learning algorithm. In the online exploration phase, it utilizes Policy-MGDA to resolve gradient conflicts between rewards and costs in the action space and employs an adaptive step size to keep data collection costs within thresholds. In the offline learning phase, it uses Truncated Quantile Critics (TQC) to stabilize cost value estimation and quantify epistemic uncertainty, achieving high sample efficiency while ensuring cost constraints are met during both training and testing phases.
- Offline Preference-based Value Optimization
-
This paper proposes PVO (Preference-based Value Optimization), which directly optimizes the value function using a novel "value alignment loss" to ensure consistency with preference feedback. While achieving the optimal sample complexity of \(O(\varepsilon^{-2})\), it stably outperforms multiple strong baselines on continuous control benchmarks without requiring additional preference learning hyper-parameters.
- Offline Reinforcement Learning with Adaptive Feature Fusion
-
Addressing the issue where Decision Transformer-style "RL as sequence modeling" methods overfit historical sub-optimal sub-trajectories and fail to stitch superior trajectories, this paper proposes QDFFDT. It utilizes a learnable, state-dependent fusion coefficient to adaptively weight and fuse "global sequence features" and "local single-step Markov features," combined with a Q-learning module for value guidance, achieving SOTA on D4RL benchmarks.
- On-Policy RL Meets Off-Policy Experts: Harmonizing Supervised Fine-Tuning and Reinforcement Learning via Dynamic Weighting
-
This paper proposes CHORD, which reformulates SFT from an independent training stage into a dynamically weighted auxiliary objective within the on-policy RL process. Using a dual-control mechanism—a "global coefficient \(\mu\) + token-level weighting function \(\phi(\cdot)\)"—it smoothly integrates expert data, consistently outperforming baselines like SFT-then-RL on mathematical reasoning and tool-calling tasks.
- On Discovering Algorithms for Adversarial Imitation Learning
-
Proposes DAIL—the first meta-learning algorithm for Adversarial Imitation Learning (AIL). It decomposes AIL into two stages: density ratio estimation and reward assignment (RA). Using LLM-guided evolutionary search, it automatically discovers the optimal RA function \(r_{\text{disc}}\), which generalizes across unseen environments and policy optimizers while outperforming all human-designed baselines.
- On Predictability of Reinforcement Learning Dynamics for Large Language Models
-
This paper discovers that the parameter update matrix \(\Delta W\) of LLMs during RL training is almost entirely dominated by its Rank-1 subspace (a single direction can recover over 99% of reasoning gains). Furthermore, this subspace evolves approximately linearly during training and can be extrapolated from early checkpoints. Based on these findings, the authors propose AlphaRL, a parameter-free acceleration framework that extrapolates final updates using the first 40% of training steps, achieving up to 2.5× acceleration while preserving \(>96\%\) of reasoning performance.
- On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification
-
This work mathematically proves from an RL policy gradient perspective that the SFT gradient implicitly encodes a pathological reward structure of inverse probability weighting (\(1/\pi_\theta\)). This causes excessively large gradients for low-probability tokens, which limits generalization. The authors propose DFT (Dynamic Fine-Tuning), which eliminates this weighting via a one-line code modification (multiplying CE loss by the token probability: \(-p\log p\)). DFT significantly outperforms SFT in mathematical reasoning, code generation, and multimodal tasks, and even surpasses GRPO/PPO in offline RL settings.
- On the \(O(1/T)\) Convergence of Alternating Gradient Descent-Ascent in Bilinear Games
-
This paper provides the first proof that Alternating Gradient Descent-Ascent (AltGDA) converges to the Nash Equilibrium (NE) at an \(O(1/T)\) rate in constrained bilinear zero-sum games (when an interior NE exists). This is faster than the \(O(1/\sqrt{T})\) rate of Simultaneous GDA. The study characterizes the "friction" effect during boundary collisions using energy function decay and further optimizes step sizes via Performance Estimation Programming (PEP).
- On the Tension Between Optimality and Adversarial Robustness in Policy Optimization
-
This paper reveals from an optimization perspective that although "optimal policies" and "robust optimal policies" can theoretically align, standard policy optimization (SPO) and adversarial robust policy optimization (ARPO) converge to different first-order stationary policies (FOSPs) in practice, creating a tension between "robustness vs. natural return." The root cause is that the strongest adversary reshapes the optimization landscape into a rugged terrain, creating numerous "sticky" suboptimal stable points. Accordingly, the authors propose a bilevel framework, BARPO, which smooths the landscape by modulating adversarial intensity, achieving both high natural rewards and strong robustness on MuJoCo.
- One-Step Flow Q-Learning: Addressing the Diffusion Policy Bottleneck in Offline RL
-
This paper transforms Diffusion Q-Learning (DQL)—the strongest but slowest and most fragile approach in offline RL—from a DDPM multi-step denoising process into a Flow Matching framework. By replacing marginal velocity with an "average velocity field," the policy generates actions in one single step during both training and inference. This achieves significant acceleration and outperforms multi-step DQL on D4RL, reaching a new SOTA.
- One Life to Learn: Inferring Symbolic World Models for Stochastic Environments from Unguided Exploration
-
This paper proposes ONELIFE, which enables an agent to run a single unguided episode in a complex, dangerous, and stochastic open world and infer the environment's transition dynamics \(p(s_{t+1}\mid s_t,a_t)\) as a set of executable probabilistic "law" programs from observations alone. By utilizing a "precondition-effect" structure to construct on-demand dynamic computation graphs, the method backpropagates gradients only to truly relevant laws, outperforming the strong baseline PoE-World in 16 out of 23 mechanisms in Crafter-OO.
- One Model for All Tasks: Leveraging Efficient World Models in Multi-Task Planning
-
ScaleZero is proposed to address gradient conflict and plasticity collapse in multi-task learning by introducing a Mixture-of-Experts (MoE) architecture into a unified world model. Combined with a Dynamic Parameter Scaling (DPS) strategy for adaptive model capacity allocation, a single multi-task model achieves performance comparable to single-task expert models across Atari, DMC, and Jericho benchmarks while reducing environmental interactions by approximately 28.5%.
- Online Minimization of Polarization and Disagreement via Low-Rank Matrix Bandits
-
This work formalizes the problem of minimizing polarization and disagreement under the Friedkin-Johnsen opinion dynamics model as an online low-rank matrix bandit problem (OPD-Min) for the first time. It proposes a two-stage algorithm, OPD-Min-ESTR, which reduces the dimensionality from \(|V|^2\) to \(O(|V|)\) through subspace estimation, significantly outperforming full-dimensional linear bandit baselines on both synthetic and real-world networks.
- Online Prediction of Stochastic Sequences with High Probability Regret Bounds
-
This work revisits the classic problem of universal prediction for stochastic sequences under a finite time horizon \(T\). It provides the first vanishing regret bound that holds with high probability (in the form of \(O(T^{-1/2}\delta^{-1/2})\)), which is highly consistent with existing expected regret bounds of \(O(T^{-1/2})\). Furthermore, it proves that the exponent of \(\delta\) cannot be improved without additional assumptions.
- Operator Theory-Driven Autoformulation of MDPs for Control of Queueing Systems
-
This paper utilizes Large Language Models (LLMs) to automatically translate natural language descriptions of queueing control problems into Bellman equations in the form of "operator graphs." By leveraging a rigorously proven "universal three-layer topology" to prune the vast modeling search space, it employs a customized MCTS for graph construction and low-complexity dynamic programming to automatically identify the structure of optimal policies (e.g., threshold/monotone types). On a self-constructed dataset of 36 problems, the modeling accuracy was improved from single digits in baselines to 83.3%.
- OPRIDE: Efficient Offline Preference Reinforcement Learning via In-Dataset Exploration
-
OPRIDE addresses the high cost of human feedback in Offline Preference Reinforcement Learning (PbRL) by proposing Difference-of-Value-Differences to select the most informative preference queries and Variance-driven Discount Scheduling to suppress over-optimization of learned rewards. It significantly outperforms previous SOTA on Meta-World and AntMaze using only approximately 10 preference queries.
- Optimal Robust Subsidy Policies for Irrational Agent in Principal-Agent MDPs
-
This paper investigates how a principal can design subsidies within an MDP framework to guide a potentially partially irrational agent. It proves that when the agent is "globally \(\epsilon\)-incentive compatible," the seemingly complex bi-level minimax problem can be equivalently reduced to one-dimensional concave optimization. Conversely, when incentive compatibility constraints are refined to a "per-state" basis, the problem either leads to non-Markovian policies or becomes NP-hard.
- Optimistic Task Inference for Behavior Foundation Models
-
This paper proposes OpTI-BFM—a method for Behavior Foundation Models that, during test-time, does not require a complete reward function or labeled datasets. Instead, it infers tasks and recovers Oracle performance within just 5 episodes of environment interaction. The core idea leverages the linear structure of successor features to reduce task inference to a linear bandit problem, utilizing a UCB strategy for optimistic exploration with a formal regret bound.
- P-GenRM: Personalized Generative Reward Model with Test-time User-based Scaling
-
Ours proposes P-GenRM, the first personalized generative reward model. Through three-stage training (PSI supervised fine-tuning to build structured evaluation chains → CRE reinforcement learning to enhance reasoning under missing preferences → hard negative curriculum learning to improve robustness), mixed preference signals are transformed into scenario-adaptive user personas and scoring rubrics. By introducing dual-granularity test-time scaling (individual-level multi-sampling aggregation + prototype-level collaborative filtering leveraging similar user preferences), it outperforms the Prev. SOTA by 2.31% on PersonalRewardBench, achieves an additional 3% Gain via test-time scaling, and generalizes to unseen users.
- PAMDP: Interact to Persona Alignment via a Partially Observable Markov Decision Process
-
This paper models "gradual alignment to user persona during multi-turn interaction" as a Partially Observable Markov Decision Process (PAMDP) where user profiles are unobservable. It utilizes a lightweight Actor with continuous latent space actions and a "partial state + full state" dual Critic for unbiased advantage estimation, achieving higher alignment win rates and cumulative rewards on both offline datasets and online simulators.
- Parallel-R1: Towards Parallel Thinking via Reinforcement Learning
-
Parallel-R1 proposes the first framework to inject "parallel thinking" capabilities into real-world mathematical reasoning tasks via reinforcement learning (RL) rather than pure SFT. By employing a progressive curriculum—"cold-start data generation via simple task prompts → SFT for format learning → simple-task RL for format stabilization → difficult-task RL for performance enhancement"—the framework bypasses cold-start challenges. Combined with alternating rewards, it outperforms sequential RL baselines by an average of 8.4% on AIME/AMC/MATH. Furthermore, it identifies that parallel thinking acts as a "mid-training exploration scaffold," yielding Gains up to 42.9%.
- Parameter-Efficient Reinforcement Learning using Prefix Optimization
-
This paper proposes optimizing only the first \(k\) tokens (the prefix) of a response while delegating the subsequent generation to a frozen reference model. This demonstrates that a significant portion of RLVR gains in mathematical reasoning stems from "selecting a better problem-solving strategy/format." Based on this, a computationally efficient method, Prefix-RL, is derived: using a 1B adapter to generate prefixes that guide 7B~72B models. Training only the adapter improves Qwen-7B from 67.4% to 74.4% on MATH-500.
- Peak-Return Greedy Slicing: Subtrajectory Selection for Transformer-based Offline RL
-
PRGS introduces a pre-processing step for Transformer-based offline RL that "picks high-quality fragments at the timestep level." It utilizes an MMD return estimator to calculate optimistic future return distributions for each state-action pair, greedily slices trajectories into "peak-return subtrajectories" for training, and adaptively truncates history during evaluation. This approach achieves an average improvement of 15.8% across multiple benchmarks including D4RL, BabyAI, and AuctionNet.
- Peng's Q(\(\lambda\)) for Conservative Value Estimation in Offline Reinforcement Learning
-
CPQL introduces the multi-step operator Peng's Q(\(\lambda\)) from online RL into offline RL for the first time, replacing the single-step Bellman operator in CQL for conservative value estimation. By leveraging the property that the PQL fixed point naturally aligns with the behavioral policy value, it mitigates over-pessimism, consistently outperforms various single-step baselines on D4RL, and enables seamless offline-to-online finetuning.
- PoLi-RL: A Point-to-List Reinforcement Learning Framework for Conditional Semantic Textual Similarity
-
This paper introduces Reinforcement Learning (RL) to the Conditional Semantic Textual Similarity (C-STS) task for the first time. It proposes PoLi-RL, a two-stage curriculum RL framework that progresses from "point-to-list," along with a Parallel Sliced Ranking Reward (PSRR) mechanism that decomposes coarse batch-level ranking signals into precise rewards for each individual completion. An 8B model trained with this framework achieves a Spearman correlation of 48.18 on the official C-STS benchmark, surpassing GPT-4o and DeepSeek-R1 to set a new Cross-encoder SOTA.
- Policy Likelihood-based Query Sampling and Critic-Exploited Reset for Efficient Preference-based Reinforcement Learning
-
PoLiCER addresses two persistent issues in preference-based RL—the decoupling of queries from the current policy and the reward estimator's overfitting to early feedback. It proposes a combination of "query sampling ranked by policy likelihood" and "reward/Q-network resets triggered by critic outputs." The method consistently outperforms existing baselines such as PEBBLE and QPA on locomotion and robotic arm tasks in DMControl and Meta-World.
- Policy Newton Algorithm in Reproducing Kernel Hilbert Space
-
This paper proposes the first second-order policy optimization method in RKHS, dubbed Policy Newton in RKHS. By optimizing a cubic-regularized auxiliary objective, the method bypasses the infinite-dimensional Hessian inversion. Leveraging the Representer Theorem, the infinite-dimensional optimization is equivalently transformed into a finite-dimensional problem where the dimension scales with the trajectory data \(NT\). The authors theoretically prove convergence to local optima with a local quadratic convergence rate. Empirically, the method demonstrates faster convergence and higher returns compared to first-order RKHS methods and parametric second-order methods.
- PolicyFlow: Policy Optimization with Continuous Normalizing Flow in Reinforcement Learning
-
This paper proposes PolicyFlow, which seamlessly integrates Continuous Normalizing Flow (CNF) policies into the PPO framework. By approximating importance ratios through velocity field changes along an interpolation path (avoiding backpropagation through the full ODE path) and introducing an implicit entropy regularizer inspired by Brownian motion to prevent mode collapse, PolicyFlow achieves or exceeds the performance of Gaussian PPO and flow-based baselines (FPO/DPPO) in environments such as MultiGoal, PointMaze, IsaacLab, and MuJoCo.
- Polychromic Objectives for Reinforcement Learning
-
To address the problem where Reinforcement Learning Fine-tuning (RLFT) tends to collapse the policy into a few high-reward behaviors and discard the diversity of the pre-trained model, this paper proposes "polychromic objectives." This method couples reward with diversity, assigning high scores only to sets of trajectories that are both "successful and diverse." By integrating vine sampling and set-shared advantages into PPO (Polychromic PPO), the method achieves higher success rates, greater pass@k coverage, and stronger perturbation robustness across BabyAI, Minigrid, and Algorithmic Creativity tasks.
- Post-training Large Language Models for Diverse High-Quality Responses
-
The authors propose DQO (Diversity Quality Optimization), which defines a diversity metric in the semantic embedding space based on Determinantal Point Processes (DPP). By jointly optimizing this metric with reward signals, LLM post-training improves both semantic diversity and response quality. DQO can be integrated on top of GRPO/PPO.
- Potentially Optimal Joint Actions Recognition for Cooperative Multi-Agent Reinforcement Learning
-
Ours proposes POW (Potentially Optimal Joint Actions Weighting), which uses an explicit joint-action-conditioned recognition module \(Q_r\) to iteratively "identify" a set of potentially optimal joint actions and assign them higher training weights. This theoretically guarantees the recovery of the true optimal policy, bridging the gap between the "theoretical promise" and "heuristic approximation" of the WQMIX series. It consistently outperforms value-based SOTA in tasks such as Matrix Games, Predator-Prey, SMAC/SMACv2, and highway-env.
- Predictive CVaR Q-Learning
-
This paper proposes Predictive CVaR Q-learning (PCVaR-Q), which reformulates the CVaR objective—originally only evaluable at the end of a trajectory—into a step-by-step recursive Bellman form by introducing a pair of "predictive tail value/probability functions." Combined with a "bi-directional exploration" strategy that simultaneously explores actions and risk budgets, it significantly improves the sample efficiency and training stability of risk-sensitive RL, approaching the CVaR-optimal policy in both decision trees and stochastic grid worlds.
- Preference-based Policy Optimization from Sparse-reward Offline Dataset
-
PREFORL reformulates sparse-reward offline RL as a contrastive preference learning problem. By bypassing value function estimation and contrasting successful trajectories against both "in-dataset failures" and "synthesized out-of-distribution failures," it suppresses value overestimation and enhances robustness. It consistently outperforms SOTA methods like CQL, IQL, CPL, and ReBRAC on sparse-reward benchmarks including Adroit, Sparse-MuJoCo, Maze2D, and MetaWorld.
- PreferThinker: Reasoning-based Personalized Image Preference Assessment
-
The paper proposes PreferThinker, which connects diverse users through universal visual preference profiles and adopts a "predict-then-assess" CoT reasoning paradigm for interpretable personalized image preference assessment. Combined with cold-start SFT + GRPO reinforcement learning and a similarity-aware prediction reward, the 7B model outperforms GPT-4o (+5.2%) and Claude 3.7 (+5.1%).
- Primal-Dual Policy Optimization for Linear CMDPs with Adversarial Losses
-
This paper proposes the first primal-dual policy optimization algorithm for finite-horizon linear CMDPs with adversarial losses and stochastic costs. By employing a novel "weighted LogSumExp softmax policy" combined with periodic policy mixing and regularized dual updates, it achieves sublinear regret and constraint violation (both \(\tilde{O}(K^{3/4})\)) while controlling the covering number of the policy class and the dual variables.
- Principled Fast and Meta Knowledge Learners for Continual Reinforcement Learning
-
Inspired by the hippocampal-cortical interaction mechanism in the human brain, this paper proposes the FAME dual-learner framework. It achieves efficient continual reinforcement learning by employing a fast learner for knowledge transfer and a meta learner for knowledge integration, while principledly minimizing catastrophic forgetting.
- Principled RL for Diffusion LLMs Emerges from a Sequence-Level Perspective
-
Addressing the fundamental contradiction where diffusion large language models (dLLMs) generate non-autoregressively and lack token-level conditional probabilities (rendering GRPO directly inapplicable), this paper proposes ESPO. By treating "generating the entire sequence" as an atomic action and using ELBO as a computable proxy for sequence log-likelihood, combined with length-normalized importance ratios and a k2 KL estimator for stability, ESPO significantly outperforms token-level RL baselines on math, code, and planning tasks (gains of 20–40 or even 60+ points on Countdown/Sudoku).
- Probing in the Dark: State Entropy Maximization for POMDPs
-
Addressing the POMDP challenge where true states are unobservable, this paper proposes maximizing the entropy of a predictive latent as a proxy objective. It introduces the LatEnt algorithm, which concurrently learns latent dynamics and policies. On the custom PROBE benchmark, it induces true state entropy close to the "oracle" view, enabling downstream PPO to solve sparse-reward tasks that are unlearnable from scratch.
- Prompt Curriculum Learning for Efficient LLM Post-Training
-
This paper systematically investigates how "batch size" and "prompt difficulty" jointly affect convergence during the RL post-training of LLMs. It discovers the existence of an optimal batch size and identifies that medium-difficulty prompts (with a success rate of approximately 50%) are the most efficient. Based on these findings, the authors propose PCL, a lightweight algorithm that employs an online-learned value model to predict prompt difficulty in a single forward pass to filter for medium-difficulty prompts. On mathematical reasoning benchmarks, PCL either achieves state-of-the-art performance or significantly reduces training time, with prompt filtering being 12.1×–16.9× faster than rollout-based methods.
- PROS: Towards Compute-Efficient RLVR via Rollout Prefix Reuse
-
PROS identifies that multiple rollouts for the same query are highly redundant in early reasoning steps. It constructs "Augmented Queries" by concatenating original queries with "valuable prefixes" from historical rollouts for reuse in subsequent iterations. This eliminates redundant compute and employs a hierarchical Bayesian model to estimate pass rates, prioritizing samples with pass rates near 0.5. PROS achieves higher accuracy than PPO/GRPO on AIME24/AMC23 with less wall-clock time.
- Prosperity before Collapse: How Far Can Off-Policy RL Reach with Stale Data on LLMs?
-
Addressing the issue where severe staleness of rollout data in asynchronous RL training for LLMs leads to performance degradation or training collapse, this paper first reveals the "Prosperity before Collapse" phenomenon—stale data is as informative as on-policy data, and the key lies in its utilization. The authors propose M2PO, which uses the second moment \(M_2\) of importance weights instead of \(\epsilon\)-clipping to constrain the trust region. By masking only extreme outlier tokens and retaining most useful updates, M2PO stabilizes training even with data stale by 256 updates, matching on-policy performance across six models ranging from 1.7B to 32B.
- Proximal Supervised Fine-Tuning
-
PSFT reinterprets standard SFT as "policy gradient with strictly positive advantages" and borrows the clipped trust region mechanism from PPO to impose a soft constraint on SFT updates. This preserves target task performance while significantly mitigating entropy collapse, maintaining general capabilities, and providing greater optimization space for subsequent RL/DPO stages.
- Q-Learning with Adjoint Matching
-
QAM incorporates adjoint matching techniques from generative modeling into Q-learning. It uses gradients from the critic on "clean actions" as direct step-by-step supervision to fine-tune multi-step flow policies. This approach preserves the expressivity of flow policies while avoiding numerical instability from backpropagating through denoising chains, achieving an aggregate score of 44/46 across 50 sparse-reward tasks in OGBench and surpassing all existing baselines.
- Q-Learning with Fine-Grained Gap-Dependent Regret
-
Focusing on model-free RL for episodic tabular MDPs, this paper proposes a fine-grained analysis framework that "separately counts visits for optimal and sub-optimal state-action pairs." It provides the first fine-grained gap-dependent regret upper bound involving individual gaps \(\Delta_h(s,a)\) for UCB-Hoeffding and subsequently fixes two defects in the truncation and martingale difference conditions of AMB—the only prior non-UCB algorithm—introducing two improved versions: ULCB-Hoeffding and Refined AMB.
- Q-learning with Posterior Sampling
-
This paper proposes PSQL—the first Q-learning algorithm that uses "maintaining Gaussian posteriors over Q-values and taking argmax after sampling" for exploration. By modifying the target value calculation to "optimistic multi-sample sampling," the authors prove a near-optimal regret upper bound of \(\tilde{O}(H^2\sqrt{SAT})\) for this natural posterior-sampling-style Q-learning.
- QeRL: Quantization-enhanced Low-rank Reinforcement Learning for LLMs
-
QeRL combines NVFP4 quantization with LoRA to train the reasoning capabilities of LLMs. It unexpectedly discovers that quantization noise can increase policy entropy and enhance RL exploration. By incorporating a schedulable Adaptive Quantization Noise (AQN) mechanism, 4-bit models achieve higher accuracy in mathematical reasoning than 16-bit LoRA while being significantly faster (1.5× rollout speedup, 1.8× end-to-end). This work also marks the first time RL for a 32B model has been successfully executed on a single H100 80GB GPU.
- QuestA: Expanding Reasoning Capacity in LLMs via Question Augmentation
-
To address the issue of sparse rewards and learning difficulties in RLVR on hard problems, QuestA prepends "partial solutions" to difficult questions during training as hints to reduce difficulty and densify reward signals. Combined with a curriculum that reduces the hint proportion from 50% to 25%, a 1.5B small model achieves new SOTA results on mathematical competition benchmarks such as AIME24/25 and HMMT25 (AIME24 72.5%, AIME25 62.3%).
- QuRL: Low-Precision Reinforcement Learning for Efficient Reasoning
-
QuRL eliminates the 70% training time bottleneck in RLVR by using a quantized actor for rollout decoding. By introducing Adaptive Clipping Range (ACR) and Update-Aware Quantization (UAQ), it stabilizes the off-policy bias introduced by quantization, achieving 20%–80% speedup in INT8/FP8 rollout with almost no performance degradation.
- QuRL: Rubrics As Judge For Open-Ended Question Answering
-
QuRL transforms the challenge of "lacking gold standard answers" in open-ended QA into a task of automatically mining case-wise rubrics from web articles to serve as verifiable rewards. Using the GRPO training strategy, it improves Qwen2.5-7B by an average of +17.0 points compared to the SFT baseline.
- R-Zero: Self-Evolving Reasoning LLM from Zero Data
-
R-Zero initializes two roles, a "Challenger" and a "Solver," from a single base model. The Challenger is rewarded for generating difficult problems at the edge of the Solver's capability (accuracy \(\approx 50\%\)), while the Solver is rewarded for solving them. The two are trained alternately using GRPO in a co-evolutionary process. Without any human-authored questions or labels, this method improves the mathematical reasoning average of Qwen3-4B-Base by \(+6.49\) and general reasoning by \(+7.54\).
- R1-Code-Interpreter: LLMs Reason with Code via Supervised and Multi-stage Reinforcement Learning
-
This paper employs SFT cold starting and multi-stage curriculum GRPO to train open-source LLMs into general-purpose Code Interpreters that autonomously decide when to write code versus performing text-based reasoning. The key innovation is sorting samples for curriculum learning based on "Improvement Potential" rather than task difficulty. This approach increases the average RL gain from +3.4% to +9.3% across 144 heterogeneous tasks. Ultimately, R1-CI-14B improves accuracy from 44.1% to 72.4% across 37 test tasks, surpassing GPT-4o (including its official Code Interpreter).
- R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning
-
This paper reformulates the task of "judging which of two multimodal responses is better" as a rule-based RL task. To address the training collapse issues when directly applying Reinforce++, the authors propose the StableReinforce algorithm (Pre-CLIP + Advantage Filtering + Consistency Reward + Progressive Difficulty Cold Start). They trained a 7B reward model, R1-Reward, which improves upon previous SOTA by approximately 3.5%/13.5%/14.6% across three multimodal reward benchmarks and demonstrates further performance gains as sampling iterations increase.
- R2-Dreamer: Redundancy-Reduced World Models without Decoders or Augmentation
-
Based on the DreamerV3 framework, R2-Dreamer replaces the "reconstruction decoder" with a Barlow Twins-inspired redundancy reduction self-supervised objective. It prevents representation collapse without decoders or data augmentation, performing on par with DreamerV3/TD-MPC2 on DMC and Meta-World while training 1.59× faster, and significantly outperforming baselines on the small-target benchmark DMC-Subtle.
- R2PS: Worst-Case Robust Real-Time Pursuit Strategies under Partial Observability
-
This paper addresses Pursuit-Evasion Games (PEG) on graphs in the most challenging scenario where the evader predicts pursuer moves (asynchronous movement) and the pursuer has only partial observability. By extending optimal strategies from dynamic programming (DP) through a belief maintenance mechanism and cross-graph adversarial reinforcement learning, a GNN-based pursuit strategy is trained. It achieves real-time (sub-second) zero-shot generalization to unseen real city maps, significantly outperforming PSRO baselines trained directly on test graphs in worst-case scenarios.
- R4: Nested Reasoning-Retrieval for Reward Modeling in Role-Playing Agents
-
R4 enables both the "Reward Model" and the "Role-playing Agent" to possess simultaneous reasoning + retrieval capabilities. The reward model rewrites the evaluation process into a structured reasoning chain with retrieval. Utilizing preference signals from this model, the dialogue agent is trained via GRPO, improving the character consistency of the 32B model on CharacterEval from 55.28 to 64.64, ranking first with a 68.2% win rate in human blind tests.
- RD-HRL: Generating Reliable Sub-Goals for Long-Horizon Sparse-Reward Tasks
-
Addressing the issue where high-level policies in offline hierarchical RL choose incorrect sub-goals due to value functions with generalization noise, this paper proposes RD-HRL. It extracts "transition regions" that connect multiple trajectories from offline data as a reliable decision space. A TI module then selects decision-level targets from these regions for the high-level policy, decoupling sub-goal selection from cross-trajectory value estimation. It achieves top-3% performance on 8 out of 9 long-horizon sparse-reward benchmarks including antmaze, Kitchen, and CALVIN.
- REA-RL: Reflection-Aware Online Reinforcement Learning for Efficient Reasoning
-
The REA-RL framework is proposed to identify and truncate overthinking tokens online via a distilled small reflection model, generating revised paths. Combined with a reflection reward to prevent model degradation into naive Chain-of-Thought (CoT) during RL training, it achieves a 36% reduction in inference token overhead with zero accuracy loss on DeepSeek-R1-Distill-Qwen-7B.
- Reasoning as Representation: Rethinking Visual Reinforcement Learning in Image Quality Assessment
-
Systematic experiments reveal the underlying mechanism behind the generalization capability of RL-trained reasoning IQA models—the reasoning process essentially converts redundant visual representations into compact, cross-domain aligned textual representations. Based on this, the RALI algorithm is proposed, which directly aligns images with these textual representations via contrastive learning, achieving comparable generalization performance with less than 5% of the parameters and inference time.
- Reasoning Boosts Opinion Alignment in LLMs
-
LLMs are trained via GRPO reinforcement learning to align with individual political opinions through structured reasoning. SFT+GRPO consistently outperforms ICL and ORPO baselines on datasets from the US, Germany, and Switzerland, while systematically revealing fundamental difficulties in predicting neutral stances and right-wing biases.
- Recurrent Action Transformer with Memory
-
RATE (Recurrent Action Transformer with Memory) partitions trajectories into fixed-length segments and uses a set of learnable memory embeddings to pass historical information across segments. It introduces a cross-attention-based "Memory Retention Valve" (MRV) to control whether to retain or overwrite memories. This approach significantly outperforms Decision Transformer on memory-intensive offline RL tasks such as ViZDoom, T-Maze, Memory Maze, and POPGym, while remaining competitive on standard Atari/MuJoCo benchmarks.
- Reevaluating Policy Gradient Methods for Imperfect-Information Games
-
The authors propose the "Policy Gradient Hypothesis" – given proper hyperparameter tuning, general policy gradient methods such as PPO and PPG are not inferior (and often superior) to specialized game-theoretic algorithms based on Fictitious Play, Double Oracle, or Counterfactual Regret Minimization in two-player zero-sum imperfect-information games. To verify this, they open-sourced tools for calculating exact exploitability in five large games for the first time and conducted the largest-scale comparative experiment to date (7000+ runs), with results overwhelmingly supporting the hypothesis.
- Reference Grounded Skill Discovery
-
RGSD utilizes reference motion data to first "ground" the latent skill space onto a semantically meaningful unit hypersphere (via contrastive pre-training). It then performs simultaneous imitation and exploration within this structured space, successfully scaling unsupervised skill discovery to a 69-DoF SMPL humanoid. This enables the high-fidelity reproduction of walking, running, sidestepping, and punching, while discovering new style-consistent variants.
- References Improve LLM Alignment in Non-Verifiable Domains
-
This paper proposes RefEval, a reference-guided LLM-as-Judge method that uses high-quality reference outputs as "soft verifiers," improving LLM-judge accuracy by 6.8%. It further constructs a two-stage self-improvement pipeline (SFT distillation + reference-guided DPO) that outperforms SFT distillation by +19.2/+16.5 on AlpacaEval/Arena-Hard respectively, matching the performance of the fine-tuned reward model ArmoRM. This demonstrates efficient LLM alignment in non-verifiable domains without the need for human preference annotations.
- Refining Hybrid Genetic Search for CVRP via Reinforcement Learning-Finetuned LLM
-
This paper proposes RFTHGS, which fine-tunes a 14B small model using reinforcement learning to automatically generate crossover operators for the Hybrid Genetic Search (HGS) solver. The operators generated for CVRP outperform those manually designed by human experts and generalize stably to instances with up to 1000 nodes, surpassing trillion-parameter commercial models such as GPT-4o, o3, and o4-mini.
- ReFORM: Reflected Flows for On-support Offline RL via Noise Manipulation
-
ReFORM is proposed to manipulate the source distribution of a behavior-cloning (BC) flow policy by learning a reflected flow noise generator. This achieve support constraints in a constructive manner, avoiding OOD (Out-Of-Distribution) issues while maintaining policy expressivity without hyperparameter tuning.
- Regret-Guided Search Control for Efficient Learning in AlphaZero
-
The Regret-Guided Search Control (RGSC) framework is proposed to identify high-regret states by training a regret network and prioritize restarting self-play from these states. This simulates the human learning method of "repeatedly reviewing mistakes," outperforming AlphaZero by an average of 77 Elo in 9×9 Go, 10×10 Othello, and 11×11 Hex.
- Reinforcement Learning for Machine Learning Engineering Agents
-
This paper identifies that in Machine Learning Engineering (MLE) tasks with reliable verifiers, updating the parameters of a small model (Qwen2.5-3B) via RL is more effective than repeatedly prompting a frozen large model. Given sufficient compute, the RL-adapted small model outperforms Claude-3.5-Sonnet driven by a SOTA scaffold (AIDE) by an average of 22% across 12 Kaggle tasks. To achieve this, the authors address two pain points in asynchronous RL: using "duration-aware gradients" to correct fast-action bias and "environment instrumentation" to convert sparse rewards into verifiable partial rewards.
- Reinforcement Learning via Value Gradient Flow
-
This paper proposes Value Gradient Flow (VGF), which reformulates "behavior-regularized RL" as an optimal transport problem from a reference distribution to a value-induced optimal distribution. By using particle gradient flow to transport initial actions along value gradients step-by-step, the method achieves implicit control over deviation through a "transport budget" without explicit policy parameterization or regularization terms. It achieves SOTA performance on D4RL, OGBench, and RLHF.
- Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs
-
Addressing the debate over whether RLVR truly improves reasoning capabilities or merely enhances sampling efficiency, this paper proposes a new metric, CoT-Pass@K (requiring both correct answers and correct reasoning). Using a theoretical framework for GRPO, it proves that as long as the base model possesses a "logic prior" where correct CoT more likely leads to correct answers, binary rewards based solely on answer correctness will implicitly drive up the probability of generating correct reasoning, thereby authentically extending the reasoning boundary of base models.
- Reinforcement Mid-Training
-
This paper introduces Reinforcement Mid-Training (RMT) to fill the gap between pre-training and post-training. It utilizes unannotated pre-training corpora with next-token prediction as a verifiable reward for RL. The RMT framework employs dynamic token budgets, curriculum-based difficulty sampling, and a dual objective of "Selective RL + Full NTP," achieving up to +64.91% improvement in language modeling over the SOTA RPT while requiring only 21% of the inference length.
- Relative Entropy Pathwise Policy Optimization
-
REPPO transitions "pathwise policy gradients" (updating policies via Q-function derivatives), typically reliant on large replay buffers in off-policy settings, into a purely on-policy framework. By learning a sufficiently accurate Q-function using only current policy trajectories, combined with maximum entropy exploration and auto-tuned KL constraints, it outperforms PPO with higher sample efficiency and lower memory usage, while matching the performance of off-policy FastTD3 on GPU-parallelized benchmarks.
- Relative Value Learning
-
Addressing the observation that "control only cares about value differences while the absolute value scale is a redundant degree of freedom," this paper proposes Relative Value Learning (RV). The critic directly learns an antisymmetric function \(\Delta_\theta(s_i,s_j)=V^\pi(s_i)-V^\pi(s_j)\) supported by a Pairwise Bellman Operator (proven to be a \(\gamma\)-contraction with its unique fixed point equal to the true value difference). The method includes well-defined 1-step / n-step / λ-return targets and an unbiased advantage estimator, R-GAE, reconstructed from pairwise differences. Integrated with PPO, it performs comparably to or better than standard PPO across 49 Atari games.
- Reliability-Adjusted Prioritized Experience Replay
-
The authors argue that using absolute Temporal Difference Error (TDE) as sampling weights in PER can "mislead learning" if the target Q-values themselves are inaccurate. They propose a "reliability score" \(R_t\) based on the sum of subsequent TDEs within a trajectory, modifying the sampling weights to "Reliability × Absolute TDE." Theoretically, the convergence error is proven to be strictly superior to PER, and empirically, it consistently outperforms PER in classic control and Atari-10 (with a 22.97% higher median peak score in Atari-10).
- Pruning as a Cooperative Game: Surrogate-Assisted Layer Contribution Estimation for Large Language Models
-
LLM layer pruning is modeled as a cooperative game (each layer = player, model performance = utility). Since exact Shapley value calculation is infeasible (\(2^L\) combinations), a two-stage approximation is proposed: (1) stratified Monte Carlo sampling to generate masks and evaluate PPL as supervision signals; (2) training a lightweight surrogate network to predict performance for arbitrary masks. This allows efficient estimation of each layer's Shapley value while capturing inter-layer dependencies, significantly outperforming static heuristic pruning baselines.
- Replicable Reinforcement Learning with Linear Function Approximation
-
Ours provides the first provably replicable reinforcement learning algorithm beyond tabular settings for linear MDPs: by constructing two foundational tools—replicable ridge regression and uncentered covariance estimation—and integrating them into LSVI / LSVI-UCB frameworks, the algorithm outputs bit-wise identical policies with high probability when run on two independent datasets. Empirical results on CartPole and Atari confirm that quantization ideas lead to more consistent neural policies.
- Representation-Based Exploration for Language Models: From Test-Time to Post-Training
-
Ours proposes RepExp: an "elliptical diversity bonus" constructed from a pretrained language model's own hidden states to explicitly incentivize exploration. It is first validated on a clean "test-time selection" testbed, then integrated into GRPO post-training. Results demonstrate a 50%+ improvement in verifier efficiency at test-time and the complete elimination of the common RL phenomenon where "pass@k collapses at large k" during post-training.
- RESCHED: Rethinking Flexible Job Shop Scheduling from a Transformer-based Architecture with Simplified States
-
RESCHED reduces the state of Flexible Job Shop Scheduling (FJSP) from "20+ manual features + historical dependency" to just 4 core features. It pairs this with a dual-branch Transformer tailored for scheduling (using RoPE for operation ordering, embedding processing time as edge features into attention values, and employing self-connections to mitigate imbalances in operation/machine counts). Using only basic REINFORCE for training, it outperforms all scheduling rules and SOTA Graph Neural Network methods on FJSP, while generalizing to JSSP and FFSP variants with zero modifications.
- ResT: Reshaping Token-Level Policy Gradients for Tool-Use Large Language Models
-
ResT targets RL training for tool-use LLMs. It theoretically proves that "low-entropy structured tokens (tool names, parameters, format tags) are the primary determinants of rewards, and reducing average entropy minimizes policy gradient variance." Based on this, it proposes inverse reweighting of token-level policy gradients by regional average entropy and employs curriculum annealing to transition weights from "format correctness" to "semantic reasoning." It achieves up to an 8.76% improvement over GRPO on BFCL/API-Bank, with the 4B model outperforming GPT-4o by 1.50% on multi-turn base tasks.
- Retaining Suboptimal Actions to Follow Shifting Optima in Multi-Agent RL
-
Ours proposes S2Q (Successive Sub-value Q-learning), which explicitly retains suboptimal joint actions by progressively learning \(K\) sub-value functions. Combined with a Softmax behavior policy for prioritized sampling among candidates, it addresses the fundamental issue in cooperative MARL where value decomposition methods converge to suboptimal policies due to dynamic shifts in the optimal point.
- ReTool: Reinforcement Learning for Strategic Tool Use in LLMs
-
ReTool employs a training framework of "cold-start SFT + tool-augmented RL" to enable LLMs to autonomously learn "when and how to call a code interpreter" during long-chain reasoning. By using only outcome-based rewards, a 32B model achieved 67.0% on AIME2024, significantly surpassing the text-only RL baseline (40.0%) while using only one-third of the training steps.
- Revisiting Group Relative Policy Optimization: Insights into On-Policy and Off-Policy Training
-
This paper generalizes DeepSeek's GRPO from on-policy to off-policy by utilizing a lagged policy \(\alpha=\pi_{k-v}\) to whiten rewards and estimate advantages. It proves that both on-policy and off-policy objectives provide a lower bound for expected reward improvement, leading to a clipped surrogate objective consistent with off-policy PPO. Experiments demonstrate that off-policy GRPO (updating the inference server every \(v\) steps) performs as well or better than its on-policy counterpart in mathematical reasoning tasks while increasing training throughput for a 7B model by approximately 1.35×.
- Revisiting Matrix Sketching in Linear Bandits: Achieving Sublinear Regret via Dyadic Block Sketching
-
This paper reveals the fundamental flaw of existing matrix-sketching-based linear bandit methods, which degrade to linear regret when the spectral tail of the streaming data is heavy. It proposes the Dyadic Block Sketching framework, a multi-scale sketching approach that controls the global approximation error to a preset parameter \(\epsilon\) by dynamically doubling sketch sizes. This ensures sublinear regret without prior knowledge of the spectral properties of the stream matrix and adaptively recovers the computational efficiency of single-scale methods in spectral-friendly scenarios.
- Revolutionizing Reinforcement Learning Framework for Diffusion Large Language Models
-
This paper proposes TraceRL—a trace-aware reinforcement learning framework that incorporates the decoding trace of Diffusion Large Language Models (DLMs) during inference into the post-training objective. It features a variance-reducing diffusion value model that uniformly adapts to both full-attention and block-attention DLMs. Based on this, the TraDo series of SOTA diffusion language models are trained, outperforming autoregressive models of the same or even larger sizes in math and code reasoning.
- Reward is Enough: LLMs are In-Context Reinforcement Learners
-
This paper discovers that Reinforcement Learning behaviors emerge in LLMs during the inference phase (In-Context RL, ICRL). By concatenating past responses and corresponding scalar rewards into the context through multi-turn prompting, the model's response quality monotonically improves with context growth. It significantly outperforms Self-Refine and Reflexion on Game of 24, Creative Writing, ScienceWorld, and AIME/HMMT, remaining effective even when rewards are generated through self-evaluation.
- RewardMap: Tackling Sparse Rewards in Fine-grained Visual Reasoning via Multi-Stage Reinforcement Learning
-
Ours proposes the RewardMap framework to overcome sparse rewards in fine-grained visual reasoning through difficulty-aware detailed reward design and a multi-stage RL curriculum strategy transitioning from simple perception to complex reasoning.
- Risk-Sensitive Reinforcement Learning for Alleviating Exploration Dilemmas in Large Language Models
-
To address the "exploration dilemma" in Reinforcement Learning with Verifiable Rewards (RLVR), where pre-trained LLMs only strengthen existing sparse solutions leading to stagnant or declining diversity (pass@k), this paper constructs a risk-seeking objective using exponential utility that smoothly interpolates between "mean reward" and "max reward." This derives the RS-GRPO algorithm, which requires only modifying the advantage function, effectively improving pass@k while maintaining or enhancing pass@1 across 6 mathematical reasoning benchmarks and 5-6 LLMs.
- RiskPO: Risk-based Policy Optimization with Verifiable Reward for LLM Post-Training
-
Addressing the issues of early entropy collapse and reasoning boundary stagnation in RLVR methods like GRPO that "optimize mean reward," this paper proposes RiskPO. It replaces the mean objective with a Mixed Value-at-Risk (MVaR) objective, focusing gradient signals on the left tail of the reward distribution (hard problems). Combined with problem bundling to transform binary feedback into a continuous distribution, RiskPO consistently outperforms GRPO and its variants in Pass@1 and Pass@k across mathematical, multimodal, and code reasoning tasks.
- RL for Reasoning by Adaptively Revealing Rationales
-
This paper proposes AdaBack (Adaptive Backtrack): a method that dynamically reveals a prefix of the target reasoning chain as a prompt during RL training and performs a stochastic binary search on the "reveal ratio" based on reward feedback. This allows the model to transition from "completing the last step" to "generating the full chain from scratch," enabling the learning of novel reasoning capabilities on sparse reward tasks where both SFT and standard RL fail.
- RL Grokking Recipe: How Does RL Unlock and Transfer New Algorithms in LLMs?
-
The authors developed a controlled synthetic programming benchmark, DELTA, demonstrating that on "hard problem families" where base models fail to sample any correct solution (\(pass@K=0\)), a staged RL recipe—initial dense per-test reward warmup followed by a switch to binary full-pass reward—enables models to undergo a grokking phase transition after a near-zero reward plateau, jumping to near-perfect scores. This process unlocks entirely new algorithmic strategies unavailable to the base model and systematically characterizes generalization boundaries along three axes: exploration, composition, and transformation.
- RL Squeezes, SFT Expands: A Comparative Study of Reasoning LLMs
-
Transcending the "accuracy-only" perspective, this paper proposes an analysis framework to quantify reasoning processes at both trajectory-level and step-level (reasoning graph) granularities. By systematically comparing the distinct shaping effects of RL and SFT on reasoning LLMs, it concludes that RL "squeezes" while SFT "expands" the reasoning space, providing a mechanistic explanation for why the "SFT followed by RL" two-stage training paradigm is effective.
- RLAC: Reinforcement Learning with Adversarial Critic for Free-Form Generation Tasks
-
RLAC reformulates free-form generation post-training, which requires satisfying numerous implicit rubrics, as a minimax game between a generator and a learnable Critic. Instead of enumerating all rubrics, the Critic selects the single most likely-to-fail rubric for external verification. This approach outperforms both exhaustive verification and reward models in biographical factuality and code generation, while reducing verification calls by up to 5.7×.
- RLP: Reinforcement as a Pretraining Objective
-
Ours proposes RLP (Reinforcement Learning Pretraining), an information gain-driven RL pretraining objective. By rewarding Chain-of-Thought (CoT) that increases the probability of next-token prediction, it shifts RL from the post-training stage to the pretraining stage, achieving dense reward signals without a verifier.
- RL's Razor: Why Online Reinforcement Learning Forgets Less
-
This paper discovers that the KL divergence between the base model and the fine-tuned model on the new task distribution can predict catastrophic forgetting. It explains why on-policy RL, compared to SFT, tends to find high-reward solutions closer to the original policy, thereby forgetting less when learning new tasks.
- RLVER: Reinforcement Learning with Verifiable Emotion Rewards for Empathetic Agents
-
RLVER treats a "sentient user simulator" with self-consistent emotion updates as an RL environment, using the emotion scores provided by the simulated user at the end of multi-turn dialogues as verifiable rewards to train LLMs end-to-end for empathy. This approach allows Qwen2.5-7B-Instruct to improve from 13.3 to 79.2 on the Sentient Benchmark, approaching top-tier closed-source models with almost no loss in math or coding abilities.
- RLVMR: Reinforcement Learning with Verifiable Meta-Reasoning Rewards for Robust Long-Horizon Agents
-
To address the problem where end-to-end RL, which "only rewards final success," reinforces redundant and deviated reasoning paths, RLVMR enables agents to explicitly label cognitive steps using four tags—
<planning>/<explore>/<reflection>/<monitor>. It issues verifiable dense rewards for these meta-reasoning behaviors via programmatic rules, optimized with GRPO-MR using dual-level advantages. A 7B model achieved an 83.6% success rate on the most difficult unseen task split (L2) of ALFWorld while significantly reducing invalid and repeated actions. - RM-R1: Reward Modeling as Reasoning
-
This work redefines reward modeling as a reasoning task and proposes the RM-R1 series of Reasoning Reward Models (ReasRM). Through reasoning distillation, RL training, and the Chain-of-Rubrics (CoR) mechanism, it outperforms 70B and GPT-4o models on three major reward model benchmarks by an average of 4.9%.
- RAMPS: Robust Adaptive Multi-step Predictive Shield
-
RAMPS employs a globally learned linear dynamics model (linear regression or deep Koopman operator) paired with a robust multi-step Control Barrier Function (CBF) shield. It scales formal shielding techniques—previously limited to systems with a dozen dimensions—to a 348-dimensional legged locomotion task, reducing safety violations by up to 90% during training while maintaining competitive task rewards.
- Robust Deep Reinforcement Learning against Adversarial Behavior Manipulation
-
This paper investigates a new type of threat in RL—behavior-targeted attacks (where an adversary guides the victim to execute a specific target policy by tampering with observations). It proposes the BIA attack method, which does not require white-box access, and the TDRT defense method based on time discounting. TDRT maintains robustness against attacks while achieving 28.2% higher original task performance than existing defenses (SA-PPO).
- Robust Multi-Objective Controlled Decoding of Large Language Models
-
This paper proposes RMOD (Robust Multi-Objective Decoding), an inference-time algorithm that dynamically calculates worst-case objective weights by solving the Nash equilibrium of a minimax game, achieving robust multi-objective alignment for LLMs without prior weight information.
- Robustness in the Face of Partial Identifiability in Reward Learning
-
This paper reformulates "partial identifiability" in reward learning from a qualitative risk into a measurable worst-case loss. It proposes Rob-ReL to output robust predictions and error certificates using a minimax approach in preference evaluation tasks.
- Routing, Cascades, and User Choice for LLMs
-
LLM routing is modeled as a provider-user Stackelberg game. It is proved that optimal routing is almost always a static threshold rule without cascading. The study reveals systematic user-provider misalignment when quality/cost rankings are inconsistent, and shows that under low churn penalties, providers are incentivized to reduce costs through throttling latency, which harms user utility.
- Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains
-
This paper introduces Rubrics as Rewards (RaR), treating itemized rubric checklists as reward functions for on-policy reinforcement learning. This extends RL with Verifiable Rewards (RLVR)—previously limited to "verifiable" tasks like math or code—to real-world reasoning domains like medicine and science where no single standard answer exists. RaR achieves up to a 31% improvement over popular LLM-as-judge Likert baselines on HealthBench and a 7% improvement on GPQA-Diamond.
- RuleReasoner: Reinforced Rule-based Reasoning via Domain-aware Dynamic Sampling
-
RuleReasoner constructs a diverse rule reasoning dataset, RuleCollection-32K, and proposes a Domain-aware Dynamic Sampling (Dads) strategy. By training 8B models under the RLVR framework, it outperforms OpenAI-o1 by 4.1% on in-domain reasoning tasks and 10.4% on out-of-domain tasks, while improving training efficiency by ~1.4×.
- SAC Flow: Sample-Efficient Reinforcement Learning of Flow-Based Policies via Velocity-Reparameterized Sequential Modeling
-
SAC Flow treats the multi-step sampling process of flow-based policies as a residual RNN. By utilizing GRU/Transformer-style velocity networks and noise-augmented rollouts, it enables stable end-to-end training of high-expressivity flow policies via SAC, achieving superior sample efficiency in continuous control and offline-to-online manipulation tasks.
- Safe Continuous-time Multi-Agent Reinforcement Learning via Epigraph Form
-
This work proposes the first continuous-time multi-agent RL framework that explicitly handles state constraints. By employing an epigraph form, discontinuous constraint value functions are transformed into continuous representations. Combined with an improved PINN actor-critic method, the framework achieves safe and stable continuous-time multi-agent control.
- Safe Exploration via Policy Priors
-
Ours proposes SOOPER, a model-based safe exploration algorithm: it utilizes a "suboptimal but conservative" prior policy as a safety guardrail. During online interaction, the agent pessimistically falls back to it to ensure safety; during simulation, it aggressively explores the world model with optimism. By reframing the constrained task into an unconstrained "terminating planning MDP"—where trajectories terminate upon fallback—the method achieves sublinear cumulative regret while maintaining safety throughout the learning process, validated on real racing hardware.
- SafeMPO: Constrained Reinforcement Learning via Probabilistic Incremental Improvement
-
SafeMPO models "safety" as an inferrable probabilistic event, shifting constrained reinforcement learning from "hard-projecting the policy into the feasible region" to "guaranteeing each step is safer than the last." By leveraging the EM framework of MPO and the log-barrier construction from interior point methods, it formulates a non-parametric proxy problem with geometric convergence guarantees. With only one hyperparameter that does not affect asymptotic behavior, its performance is competitive with or superior to highly tuned constrained RL baselines.
- Sample-efficient and Scalable Exploration in Continuous-Time RL
-
The authors propose the COMBRL algorithm, which achieves scalable and sample-efficient exploration in continuous-time model-based RL with sublinear regret guarantees by maximizing the weighted sum of extrinsic rewards and model epistemic uncertainty.
- Sample Efficient Offline RL via T-Symmetry Enforced Latent State-Stitching
-
TELS performs the entire policy optimization of offline RL within a compact latent space constrained by "time-reversal symmetry" (T-symmetry) for state stitching. By learning out-of-distribution (OOD) friendly latent representations through a T-symmetry enforced inverse dynamics model (TS-IDM), it completely bypasses traditional action-level conservative constraints. It significantly outperforms methods like TSRL, POR, and IQL on small-sample D4RL tasks (0.5%–10% data) and real-world industrial control environments.
- Sample More to Think Less: Group Filtered Policy Optimization for Concise Reasoning
-
To address the "length inflation" problem where reasoning chains become excessively long after RLVR (GRPO) training, this paper proposes GFPO. It samples a larger set of candidates during training and calculates policy gradients using only the top-k responses filtered by length or token efficiency. By trading "more sampling during training" for "less thinking during inference," GFPO reduces length inflation in Phi-4-reasoning by up to 85% without compromising accuracy.
- Scalable In-Context Q-Learning
-
S-ICQL is proposed, integrating dynamic programming (Q-learning) and world models into the supervised ICRL framework. It employs a multi-head Transformer to simultaneously predict policies and contextual value functions. A pre-trained world model constructs lightweight and accurate prompts, and advantage-weighted regression is used to extract the policy, consistently outperforming all baselines when learning from suboptimal data in both discrete and continuous environments.
- Scalable Offline Model-Based RL with Action Chunks
-
MAC utilizes action chunk models to compress multiple single-step model calls in long-horizon offline MBRL into fewer multi-step predictions. By employing rejection sampling from a flow-based behavioral policy to select conservative and high-value action chunks, it significantly outperforms existing offline MBRL methods on 100M-scale OGBench long-horizon manipulation tasks.
- Multistep Quasimetric Learning for Scalable Goal-Conditioned Reinforcement Learning
-
This paper proposes MQE (Multistep Quasimetric Estimation), which integrates multistep Monte Carlo returns into a quasimetric distance architecture. It learns a goal-conditioned Q-function that satisfies the triangle inequality end-to-end. This enables non-hierarchical, planner-free "stitching" and compositional generalization for the first time in offline GCRL tasks spanning up to 4000 steps and real-world multi-stage robotic arm manipulation.
- Scheduling Your LLM Reinforcement Learning with Reasoning Trees
-
This paper proposes using "reasoning tree structure" rather than "answer accuracy" to measure the true learning difficulty of a problem for LLMs. It defines a new metric, Reasoning Score (r-score), and designs "Re-Schedule," a curriculum-based data scheduling algorithm, which improves average accuracy on six mathematical reasoning benchmarks by up to 3.2%.
- SCRIBES: Web-scale Scripted Semi-structured Data Extraction with Reinforcement Learning
-
Instead of letting LLMs parse webpages page-by-page, SCRIBES uses Reinforcement Learning to train a model that generates a reusable extraction script (BeautifulSoup code) after seeing a single webpage. By leveraging the property that "webpages from the same site share similar layouts," it designs cross-page rewards to ensure scripts generalize to entire groups of structurally similar pages. Script quality outperforms strong agentic baselines by 13%+, downstream QA performance improves by 4%+ on GPT-4o, and extraction costs decrease linearly with the number of similar pages.
- Selective Expert Guidance for Effective and Diverse Exploration in Reinforcement Learning of LLMs
-
Addressing the issues of ineffective exploration and entropy collapse in RLVR training for weak models, this paper proposes MENTOR. It injects expert distributions for mixed-policy sampling only at "critical decision points" (high-entropy tokens) and utilizes a Mixed-policy GRPO with asymmetric advantages. This allows the model to absorb the essence of expert reasoning rather than superficial imitation, consistently improving base model scores by 3–4 points and increasing pass@32 by an average of 9.2% across six mathematical benchmarks.
- Self-Aligned Reward: Towards Effective and Efficient Reasoners
-
Addressing the coarse-grained limitations of verifiable rewards—which "only check answer correctness and tolerate excessive verbosity"—this paper proposes Self-Aligned Reward (SAR). SAR utilizes the "relative perplexity difference of an answer under conditioned versus unconditioned query scenarios" as a model self-critique signal. When added to the verifiable rewards of PPO/GRPO, it improves accuracy by approximately 4% and compresses answer length by about 30% across 4 models and 7 benchmarks.
- Self-Harmony: Learning to Harmonize Self-Supervision and Self-Play in Test-Time Reinforcement Learning
-
The Self-Harmony framework is proposed, where a single model plays two roles (Solver solving the original problem + Reframer restating the problem). The harmonic mean score of the answer under both original and reframed perspectives is used as the pseudo-label selection criterion, replacing traditional majority voting. It achieves SOTA in 28 out of 30 experimental settings with zero training failures.
- Self-Improving Skill Learning for Robust Skill-based Meta-Reinforcement Learning
-
Ours proposes SISL (Self-Improving Skill Learning), which achieves robust skill learning under noisy offline demonstration data by decoupling high-level policies from skill improvement policies and incorporating a skill prioritization mechanism based on maximum return relabeling. This significantly enhances the performance of skill-based meta-reinforcement learning in long-horizon tasks.
- SHAPO: Sharpness-Aware Policy Optimization for Safe Exploration
-
SHAPO applies Sharpness-Aware Minimization (SAM) to policy updates: instead of taking the gradient at the current parameters \(\theta_0\), it first identifies a nearby parameter \(\theta_0+\epsilon_{\text{Down}}\) under Fisher/KL geometry that "worsens" the objective, and then uses the gradient from that point to update the policy. This maintains pessimism regarding the actor's epistemic uncertainty, simultaneously improving safety and returns across multiple continuous control tasks and significantly broadening the safety-efficiency Pareto frontier.
- Shop-R1: Rewarding LLMs to Simulate Human Behavior in Online Shopping via Reinforcement Learning
-
The authors propose Shop-R1, a framework that utilizes a hierarchical reward mechanism and difficulty-aware scaling in reinforcement learning to significantly enhance the ability of LLMs to simulate real human online shopping behavior. Compared to SFT baselines, it achieves an improvement of over 65% in exact action matching.
- Simplicial Embeddings Improve Sample Efficiency in Actor-Critic Agents
-
This paper introduces Simplicial Embeddings (SEM) as a lightweight geometric constraint for representation layers in actor-critic networks. By mapping hidden features of the actor and critic into a product space of multiple simplices, it mitigates representation collapse caused by non-stationary bootstrapping. The method improves sample efficiency across FastTD3, FastSAC, PPO, and various robotic and Atari environments.
- Single-stream Policy Optimization
-
SPO (Single-stream Policy Optimization) completely abandons the "collect a group per prompt and calculate relative advantages within the group" approach used in GRPO. It returns to the classic single-stream policy gradient: employing a lightweight KL-adaptive Bayesian value tracker to maintain a persistent success rate baseline for each prompt, performing global advantage normalization across the entire batch, and utilizing this baseline for an adaptive curriculum via priority sampling. On Qwen3-8B, it achieves an average maj@32 improvement of +3.4 pp across five math competition benchmarks compared to GRPO. Simultaneously, its "group-free" design leads to a 4.35× throughput acceleration in variable-length agentic scenarios.
- Single Index Bandits: Generalized Linear Contextual Bandits with Unknown Reward Functions
-
This paper proposes the Single Index Bandits (SIB) problem, extending generalized linear bandits to settings with unknown reward functions. A family of efficient algorithms (STOR/ESTOR/GSTOR) based on Stein's method is designed, achieving a near-optimal regret bound of \(\tilde{O}(\sqrt{T})\) under monotonically increasing reward functions.
- Skill Learning via Policy Diversity Yields Identifiable Representations for Reinforcement Learning
-
This paper employs the identifiability theory of nonlinear ICA to explain why "Mutual Information Skill Learning (MISL)" is effective. Taking Contrastive Successor Features (CSF) as a representative, it proves that as long as skills are sufficiently diverse and the critic is parameterized by an inner product, the learned features can recover the true environment state "up to a linear transformation." This provides the first identifiability guarantee for representation learning in RL and clarifies the advantages/disadvantages of design choices like inner-product parameterization, mutual information formulations, and maximum entropy regularization.
- SocialJax: An Evaluation Suite for Multi-Agent Reinforcement Learning in Sequential Social Dilemmas
-
SocialJax rewrites the "Sequential Social Dilemma" environments from Melting Pot 2.0 using JAX to create a GPU-parallelized evaluation suite. It includes 9 mixed-motive grid worlds and 6 MARL baseline algorithms, accelerating training speed by at least 50x compared to Melting Pot and verifying the social dilemma properties of each environment via Schelling diagrams.
- Solving Football by Exploiting Equilibrium Structure of 2p0s Differential Games with One-Sided Information
-
The paper proves the atomic structure of Nash equilibrium strategies in two-player zero-sum differential games with one-sided information—where the informed player P1's equilibrium strategy concentrates on at most \(I\) action prototypes (\(I\) = number of game types). This reduces the game tree complexity from \(U^{2K}\) to \(I^K\), enabling the solution of 11v11 American football in continuous action space (traditional complexity \(10^{440}\)) on an M1 MacBook within 30 minutes.
- Solving General-Utility Markov Decision Processes in the Single-Trial Regime with Online Planning
-
This paper presents the first method for solving infinite-horizon discounted General-Utility MDPs (GUMDP) under "single-trial" evaluation. It first proves that history-dependent policies are necessary in this regime and reformulates the problem into a standard MDP that tracks "running occupancy" (occupancy MDP). The problem is then solved incrementally via Monte Carlo Tree Search (MCTS) online planning. The approach significantly outperforms infinite-trial optimal and random policies across three task types: entropy exploration, imitation learning, and adversarial MDPs.
- Solving Parameter-Robust Avoid Problems with Unknown Feasibility using Reinforcement Learning
-
The authors propose Feasibility-Guided Exploration (FGE) to simultaneously identify feasible parameter subsets and learn safe policies within those subsets. It addresses parameter-robust avoidance problems where feasibility is unknown, achieving over 50% higher coverage than state-of-the-art methods in MuJoCo tasks.
- Sparse but Critical: A Token-Level Analysis of Distributional Shifts in RLVR Fine-Tuning of LLMs
-
This paper systematically dissects what Reinforcement Learning from Verifiable Rewards (RLVR) genuinely modifies in models through the lens of token-level distributional shifts. It reveals that RL fine-tuning significantly alters the next-token prediction distribution at only a very small fraction of token positions (~17% in DAPO, less than 2% in SimpleRL). Through "crossover sampling" interventions, it is demonstrated that this small group of tokens determines almost all reasoning performance gains. RLVR acts more like a precise surgery that redistributes probability mass within existing candidate sets rather than providing a global rewrite of the model.
- Spectral Bellman Method: Unifying Representation and Exploration in RL
-
The paper proposes the Spectral Bellman Method (SBM), which originates from the zero Intrinsic Bellman Error (IBE) condition to discover the spectral link between the Bellman operator and feature covariance. It derives a novel representation learning objective and naturally unifies representation learning with Thompson Sampling exploration.
- SPELL: Self-Play Reinforcement Learning for Evolving Long-Context Language Models
-
The SPELL framework is proposed, where an LLM simultaneously plays three roles—Questioner, Answerer, and Verifier—to perform self-play reinforcement learning. This approach continuously enhances long-context reasoning capabilities without human annotations, achieving consistent performance gains across 6 long-context benchmarks.
- SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models
-
Addressing the issue where the uncomputable log-likelihood of masked diffusion language models (dLLMs) leads to biased RL policy gradients, this paper proposes Sandwiched Policy Gradient (SPG). It maximizes the Evidence Lower Bound (ELBO) for samples with positive advantages and minimizes a newly derived computable Evidence Upper Bound (EUBO) for samples with negative advantages, effectively "sandwiching" the true objective. Combined with block masking estimation, it achieves improvements of 3.6%, 2.6%, 18.4%, and 27.0% over previous SOTA on GSM8K, MATH500, Countdown, and Sudoku, respectively.
- SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning
-
The SPIRAL framework is proposed, enabling LLMs to undergo self-play training in multi-turn zero-sum games. By stabilizing training through Role-conditioned Advantage Estimation (RAE), it improves reasoning capabilities by up to 10% without domain-specific data and identifies complementary cognitive skills developed across different games.
- Spotlight on Token Perception for Multimodal Reinforcement Learning
-
This paper proposes Visually-Perceptive Policy Optimization (VPPO), which quantifies the vision dependency of each token to refine learning signals at both the trajectory and token levels, significantly enhancing the multimodal reasoning capabilities of Large Vision-Language Models.
- Squeeze the Soaked Sponge: Efficient Off-Policy RFT for Large Language Model
-
This paper proposes ReMix, which transforms naturally on-policy reinforcement fine-tuning (RFT) methods like PPO/GRPO into mixed-policy algorithms capable of reusing historical rollouts. Utilizing a trio of Mix-PPG, KL-Convex constraints, and policy reincarnation, it achieves SOTA-level accuracy across five mathematical reasoning benchmarks with 30×–450× less rollout data.
- SRFT: A Single-Stage Method with Supervised and Reinforcement Fine-Tuning for Reasoning
-
SRFT uses entropy as a dynamic indicator to simultaneously apply SFT and RL losses to both demonstration data and self-exploration rollouts in a single stage. This avoids the "learning tax" of the SFT→RL two-stage paradigm and outperforms the zero-RL baseline by an average of 9.0 points across five mathematical reasoning benchmarks.
- SSVPO: Toward Effective Step-level Credit Assignment for Language Model RL Training
-
SSVPO draws inspiration from Shapley Values in multi-agent RL (MARL), treating each step in a reasoning chain as an "agent." Through an Insertion MDP, it rearranges steps into various new chains to measure the marginal contribution of each step (Sequential Shapley Value). This value serves as the advantage baseline for PPO-based policy optimization. It provides fair credit assignment for partially correct chains and identifies zero-contribution steps to shorten reasoning chains. On 7 mathematical reasoning benchmarks, SSVPO outperforms RLOO, GRPO, DAPO, VinePPO, and SPO, achieving up to a +11.6% accuracy gain, -18.1% token usage, and 1.6x inference efficiency.
- Stackelberg Coupling of Online Representation Learning and Reinforcement Learning
-
The SCORER framework is proposed to model representation learning and value function learning in Deep Q-Learning as a Stackelberg game. Through two-time-scale updates (slow update for the Q-network as the leader and fast update for the encoder as the follower), it achieves stable co-adaptation and enhances performance without altering the network architecture.
- STAIRS-Former: Spatio-Temporal Attention with Interleaved Recursive Structure Transformer for Offline Multi-Task Multi-Agent Reinforcement Learning
-
Addressing the issue in offline multi-task multi-agent reinforcement learning (MT-MARL) where existing Transformers underutilize attention and fail to exploit historical information, STAIRS-Former reconstructs the architecture with a "recursive spatial Transformer + dual-time scale history module + token dropout." This refocuses attention on key entities and historical tokens, increasing the average win rate on benchmarks like SMAC / SMAC-v2 from 57.2% (HiSSD) to 67.4%, setting a new SOTA.
- Stop Unnecessary Reflection: Training LRMs for Efficient Reasoning with Adaptive Reflection and Length Coordinated Penalty
-
Proposes ARLCP (Adaptive Reflection and Length Coordinated Penalty), an adaptive reinforcement learning method that dynamically adjusts the weights of reflection and length penalties based on problem complexity. It significantly reduces inference token consumption while maintaining or improving accuracy.
- Strict Subgoal Execution: Reliable Long-Horizon Planning in Hierarchical Reinforcement Learning
-
Ours proposes the SSE (Strict Subgoal Execution) framework, which strictly distinguishes between success and failure in reaching subgoals through Frontier Experience Replay (FER). Combined with a decoupled exploration strategy and failure-aware path optimization, it enforces subgoal completion within each high-level step, significantly reducing the number of high-level decision steps and improving success rates in long-horizon tasks.
- Structured In-context Environment Scaling for Large Language Model Reasoning
-
This paper proposes the Structured In-context Environment (SIE) framework, which automatically constructs scalable, generalizable, and verifiable LLM reasoning environments from large-scale Knowledge Graphs (KGs). By treating supporting subgraphs as soft constraints within prompts and employing GRPO for RL fine-tuning, the method significantly enhances performance on structured reasoning tasks and transfers compositional reasoning capabilities to out-of-distribution tasks such as mathematics and logic.
- SUSD: Structured Unsupervised Skill Discovery through State Factorization
-
Proposes SUSD (Structured Unsupervised Skill Discovery), which factorizes the state space into independent factors and assigns exclusive skill variables to each. Combined with a curiosity-driven factor weighting mechanism, it achieves the discovery of diverse skills covering all controllable factors in complex multi-object/multi-agent environments.
- Tackling Heavy-Tailed Q-Value Bias in Offline-to-Online Reinforcement Learning with Laplace-Robust Modeling
-
This paper reveals for the first time that the Q-value bias during the online fine-tuning stage of Offline-to-Online Reinforcement Learning (O2O RL) follows a heavy-tailed distribution. It proposes LAROO: using an adaptive Laplace noise to "absorb" the heavy-tailed nature of the bias into the noise, combined with a robust loss \(D_b(x)\) to reduce estimation variance, and a conservative ensemble estimate to pull the bias mean back to zero. LAROO outperforms previous state-of-the-art O2O methods with an average improvement of +54.8% on D4RL.
- Task Tokens: A Flexible Approach to Adapting Behavior Foundation Models
-
To address the dilemma where "Goal-Conditioned Behavior Foundation Models" (GC-BFM, e.g., MaskedMimic) suffer from either cumbersome prompt engineering or prior-damaging full fine-tuning, this paper proposes Task Tokens: the entire BFM is frozen, and only a lightweight "task encoder" is trained via reinforcement learning. This encoder produces a learnable token inserted into the transformer sequence to adapt the BFM to new tasks. Each task requires only ~200K trainable parameters (125x fewer than baselines), converges 6x faster, and is more robust and human-like in OOD scenarios (varying gravity/friction) compared to full fine-tuning.
- TD-JEPA: Latent-predictive Representations for Zero-Shot Reinforcement Learning
-
TD-JEPA transforms JEPA-style latent prediction from an "auxiliary one-step prediction loss" into a "multi-policy, multi-step, TD-trained core objective." By simultaneously learning a state encoder, task encoder, successor-feature predictor, and latent policy on reward-free offline data, it enables zero-shot strategy selection using only a few reward samples at test time.
- Temporal Representations for Exploration: Learning Complex Exploratory Behavior without Extrinsic Rewards
-
This paper proposes C-TeC, which utilizes temporal contrastive representations to estimate the similarity between current state-action pairs and future states. By converting the degree to which "future outcomes are difficult to predict in the representation space" into intrinsic rewards, it learns complex exploratory behaviors in maze coverage, robotic arm pick-and-place, and Craftax survival games without extrinsic rewards.
- The Art of Scaling Reinforcement Learning Compute for LLMs
-
This paper introduces a sigmoid-shaped "compute-performance" scaling law that decomposes LLM RL training into two fittable parameters: "performance ceiling \(A\)" and "computational efficiency \(B\)." Based on 400,000 GPU-hours of systematic ablation, the authors identify a robust recipe called SCALERL. By extrapolating curves from low-compute runs, they accurately predicted final validation performance in a single 100,000 GPU-hour training run, bringing pre-training-level predictability to RL.
- The Choice of Divergence: A Neglected Key to Mitigating Diversity Collapse in Reinforcement Learning with Verifiable Reward
-
This paper identifies that the commonly used reverse-KL (mode-seeking) regularization in RLVR is the primary cause of Pass@k diversity collapse and catastrophic forgetting. It proposes using mass-covering f-divergences (forward-KL / JS) as a "review mechanism," combined with dataset partitioning and generator-based implementations, to simultaneously improve Pass@1 and Pass@k while preserving cross-domain capabilities in mathematics and SQL tasks.
- The Markovian Thinker: Architecture-Agnostic Linear Scaling of Reasoning
-
This paper proposes Delethink: a method that enables reasoning LLMs to break an ultra-long Chain-of-Thought (CoT) into several fixed-length "chunks." Each chunk carries only a small number of tokens from the end of the previous chunk as a "Markovian state" while deleting the rest of the history. This reduces reasoning computation from quadratic to linear and maintains constant memory usage without modifying any model architecture, while achieving performance comparable to or exceeding standard Long-Chain-of-Thought (LongCoT).
- The Rank and Gradient Lost in Non-stationarity: Sample Weight Decay for Mitigating Plasticity Loss in Reinforcement Learning
-
From a theoretical optimization perspective, this paper decomposes "plasticity loss" in deep reinforcement learning into two mechanisms: rank collapse of the NTK Gram matrix and \(\Theta(1/k)\) decay of gradient magnitude. For the latter, a lightweight Sample Weight Decay (SWD) is proposed, which linearly decreases playback sampling probability with sample "age" to compensate for gradient decay and maintain learning capacity, consistently improving TD3, Double DQN, and SAC performance on MuJoCo, ALE, and DMC.
- The Sample Complexity of Online Reinforcement Learning: A Multi-Model Perspective
-
This paper proposes a set of online reinforcement learning algorithms for nonlinear dynamical systems with continuous state-action spaces. By utilizing multi-model posterior sampling and certainty equivalence policies, the approach achieves online learning of unknown systems and provides non-asymptotic policy regret guarantees ranging from finite model sets to parameterized model families.
- The State of Reinforcement Finetuning for Transformer-based Agents
-
This paper systematically introduces Reinforcement Finetuning (RFT) to the few-shot meta-RL adaptation of Transformer-based Agents (TA). Through a large-scale empirical comparison across two orthogonal axes: "Finetuning Parameter Configurations × Finetuning Algorithms," it finds that no single algorithm is universally optimal. Based on this, the authors propose QP (Q-guided Policy Optimization), a lightweight enhancement that combines the stability of SFT with the policy improvement capabilities of RL, consistently outperforming strong SFT/RFT baselines across all settings.
- Thinking on the Fly: Test-Time Reasoning Enhancement via Latent Thought Policy Optimization
-
This paper proposes Latent Thought Policy Optimization (LTPO), a test-time reasoning enhancement framework that does not require updating model parameters. By treating intermediate latent "thought" vectors as optimizable dynamic parameters, it utilizes online policy gradient methods and intrinsic confidence reward signals to enhance the reasoning capabilities of frozen LLMs.
- TIPS: Turn-Level Information-Potential Reward Shaping for Search-Augmented LLMs
-
TIPS utilizes a "lagged copy of the policy itself" as a teacher to provide a dense reward for each "reasoning + retrieval" turn based on the incremental log-likelihood of the answer. This is formulated as Potential-Based Reward Shaping (PBRS) injected into PPO to solve the sparse reward and credit assignment challenges in multi-turn tool-use RL without training an additional reward model—achieving an average EM 11.8% higher and F1 13.6% higher than PPO on 7B models while significantly mitigating training collapse.
- Token Hidden Reward: Steering Exploration-Exploitation in Group Relative Deep Reinforcement Learning
-
This paper proposes Token Hidden Reward (THR)—a token-level metric that quantifies each token's contribution to the "change in likelihood of correct responses." It finds that training dynamics are dominated by a small fraction of high |THR| tokens, and that the sign of THR corresponds exactly to exploration vs. exploitation. Based on this, a reweighting algorithm for GRPO advantages is designed using the sign of THR, where a single hyperparameter \(p\) can explicitly steer training toward exploitation (increasing greedy decoding accuracy) or exploration (increasing Pass@K).
- Toward Conservative Planning from Human-AI Preferences in Reinforcement Learning
-
This paper proposes MCP (Model-based Conservative Planning), a model-based offline preference reinforcement learning algorithm. By using the "performance difference relative to a reference policy" as the objective and "deviation regularization from the maximum likelihood model" to implicitly encode conservatism, it becomes the first to simultaneously achieve "provable sample efficiency" and "computational tractability" under conditions of partial data coverage and unknown transition dynamics. It performs comparably to or better than SOTA on Meta-World benchmarks with real human feedback.
- Toward Efficient Exploration by Large Language Model Agents
-
Instead of designing new LLM agent architectures to hope for "emergent" exploration capabilities, this paper advocates for the explicit implementation of Posterior Sampling Reinforcement Learning (PSRL)—a classical RL algorithm with proven exploration efficiency—using LLMs. By outsourcing the three core steps of PSRL to three distinct LLMs, the authors achieve cumulative regret curves that significantly outperform mainstream LLM agent baselines on bandits, tabular MDPs, and pure natural language tasks like Wordle and Combination Locks.
- Towards High Data Efficiency in Reinforcement Learning with Verifiable Reward
-
DEPO integrates "offline data curation" and "online rollout pruning" into a unified RLVR workflow for the first time. Off-line, it employs PageRank-weighted DPP and difficulty-aware normal sampling to select a diverse, influential, and moderately difficult subset. Online, it utilizes a sample-level explorability metric to skip rollouts of low-potential samples and replay underexplored ones. Consequently, it achieves performance comparable to full GRPO on AIME24/25 using only 20% of the data and 40% of the rollouts, accelerating training by approximately 1.6–1.85 times.
- Towards Strategic Persuasion with Language Models
-
Based on the Bayesian Persuasion framework, this study proposes a systematic methodology to evaluate and train the strategic persuasion capabilities of LLMs. It reveals that frontier models already possess significant strategic persuasion skills, and even small LLMs can substantially enhance their persuasive effectiveness through reinforcement learning.
- TRACED: Transition-aware Regret Approximation with Co-learnability for Environment Design
-
TRACED improves regret approximation in Unsupervised Environment Design (UED) by incorporating transition prediction error (ATPL) alongside traditional PVL to capture dynamics model mismatch, and introduces Co-Learnability to measure transfer benefits between tasks. It outperforms all baselines at 10k updates compared to their 20k performance on MiniGrid and BipedalWalker.
- Trajectory Generation with Conservative Value Guidance for Offline Reinforcement Learning
-
A Transformer policy trained with Conservative Q-Learning (CQL) interacts with a pre-trained dynamics model to autoregressively "sample" synthetic trajectories, which are then merged into the original dataset to train standard offline RL algorithms. Conservative value penalties ensure generated samples do not deviate from the data distribution, resulting in higher quality than diffusion-based data augmentation (GTA) while significantly cutting training and generation time.
- Trinity: An Evolved LLM Coordinator
-
Trinity designs a lightweight coordinator (0.6B SLM + ~10K trainable parameters in the head) optimized via sep-CMA-ES. In multi-turn dialogues, it assigns queries to different LLMs and designates them as Thinker, Worker, or Verifier. It achieves a SOTA 86.2% pass@1 on LiveCodeBench and consistently outperforms all single-model and multi-agent baselines across four in-distribution and four out-of-distribution tasks.
- Triple-BERT: Do We Really Need MARL for Ride-hailing Dispatching?
-
Addressing real-time ride-hailing dispatching—a task that is "essentially centralized but has long been hard-solved as a multi-agent problem"—this paper replaces mainstream MARL with a centralized single-agent reinforcement learning (SARL) framework, Triple-BERT (a variant of TD3 + action decomposition + BERT network + two-stage training). It achieves an overall improvement of approximately 11.95% over state-of-the-art methods on real Manhattan taxi data, with +4.26% served orders and -22.25% pickup time.
- TROLL: Trust Regions improve Reinforcement Learning for Large Language Models
-
This paper proposes TROLL (Trust Region Optimization for Large Language models), which replaces the clipping mechanism in PPO with a differentiable discrete trust region projection. It implements token-level policy updates based on principled KL constraints, consistently outperforming PPO-clip in mathematical reasoning and code generation tasks.
- TRAPO: Trust-Region Adaptive Policy Optimization
-
TRAPO decouples the traditional "SFT followed by RL" two-stage serial pipeline into an interleaved process within each individual sample. Expert trajectory prefixes are learned via SFT, while the model-generated continuations are learned via RL. By utilizing a Trust-Region version of SFT (TrSFT) to shift forward KL toward reverse KL for stable training, and employing adaptive prefix lengths to provide guidance based on problem difficulty, TRAPO achieves an average score of 56.6 across five mathematical reasoning benchmarks, outperforming SFT, pure RL, and SFT-then-RL.
- UME-R1: Exploring Reasoning-Driven Generative Multimodal Embeddings
-
The authors propose UME-R1, the first exploration of a reasoning-driven generative multimodal embedding paradigm. Through two-stage training (cold-start SFT + Reinforcement Learning), the model is trained to reason before generating representations. It significantly outperforms traditional discriminative embedding models across 78 tasks in the MMEB-V2 benchmark.
- Understanding and Improving Hyperbolic Deep Reinforcement Learning
-
Through closed-form gradient analysis, this work reveals the root causes of PPO trust region failure in hyperbolic deep RL: conformal factor explosion and large-norm embeddings. A four-component solution, Hyper++ (RMSNorm + learnable scaling + HL-Gauss + Hyperboloid), is proposed, comprehensively outperforming previous baselines across 16 ProcGen environments and Atari-5.
- Universal Value-Function Uncertainties
-
This paper proposes UVU (Universal Value-Function Uncertainties), which measures the epistemic uncertainty of a value function using the prediction error between an online network and a fixed random target network. The key innovation is that the online network does not directly regress the target output (which would yield "myopic" RND-style uncertainty); instead, it performs TD learning using synthetic rewards generated by the target network. This allows the prediction error to automatically accumulate "uncertainty over future trajectories." Theoretically, in the infinite-width limit, this error is strictly equal to the variance of a universal Q-function ensemble. Experimentally, it achieves the performance of large ensembles with a single model in offline multi-task rejection scenarios while significantly reducing computational cost.
- Deconstructing Memory in Reinforcement Learning Agents: A Taxonomy and Evaluation Methodology
-
This paper does not propose a new model but provides a formal definition and evaluation methodology for the overused term "memory" in reinforcement learning. By using the relational horizon \(\xi\) and the agent's context length \(K\) to strictly distinguish between short-term memory (STM) and long-term memory (LTM), it introduces an actionable experimental configuration algorithm. It empirically demonstrates that failing to follow this methodology leads to severely distorted evaluation conclusions.
- Unsupervised Learning of Efficient Exploration: Pre-training Adaptive Policies via Self-Imposed Goals
-
The ULEE method is proposed to meta-learn pre-trained policies with high exploration efficiency and fast adaptation through adversarial goal generation and curriculum learning based on post-adaptation difficulty in unsupervised environments.
- Unveiling the Cognitive Compass: Theory-of-Mind-Guided Multimodal Emotion Reasoning
-
The authors construct HitEmotion, a hierarchical multimodal emotion understanding benchmark based on Theory of Mind (ToM), and propose the TMPO framework to enhance MLLM emotion reasoning capabilities by using intermediate mental states as process-level supervision.
- Use the Online Network If You Can: Towards Fast and Stable Reinforcement Learning
-
This paper proposes MINTO, which modifies the TD bootstrapping target from "using only the target network" to "taking the minimum of the estimates from the target and online networks." By leveraging fresh online estimates to accelerate learning while suppressing overestimation bias via the min operator, MINTO can be integrated into algorithms like DQN, IQN, CQL, and SAC with nearly zero cost, leading to universal performance improvements.
- Using Reinforcement Learning to Train Large Language Models to Explain Human Decisions
-
The authors post-train an LLM using outcome-based reinforcement learning (GRPO) to predict human risky decision proportions while explicitly generating reasoning as a Chain-of-Thought (CoT). these CoT sequences serve as "interpretable cognitive theories" of human decision-making, achieving predictive accuracy comparable to Supervised Fine-Tuning (SFT) while providing natural language explanations unattainable through SFT.
- Value Flows
-
Value Flows introduces flow matching to distributional RL for the first time by learning a vector field where the generated probability density paths automatically satisfy the distributional Bellman equation. By efficiently estimating return variance via a flow derivative ODE, it enables confidence-weighted prioritized learning, achieving a 1.3× average success rate improvement on 62 OGBench tasks and over 3× better return distribution estimation accuracy than C51/CODAC.
- VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models
-
Addressing the reference-based reward systems widely used in Large Reasoning Model (LRM) training, this work constructs two benchmarks, VerifyBench and VerifyBench-Hard. Through rigorous human annotation to evaluate the accuracy of various verification systems, it finds that even the strongest models achieve only approximately 88% accuracy on difficult samples, revealing significant room for improvement in current verification systems.
- VeriRole: Verifiable Role-Awareness through Hint-Guided Reinforcement Learning
-
Focusing on the open-ended task of role-playing, which lacks standard answers and verifiable rewards, this paper introduces a Hint mechanism to extract deterministic cues from role profiles, dialogue history, and playing requirements. These cues serve as anchors for a designed Verifiable Role-Awareness Reward (VRAR) used in GRPO training. This approach improves Qwen2.5-32B's average score on RAIDEN by 18.9% and CharacterEval by 4.55%, while preserving the creativity and stylistic diversity of role-playing.
- Vintix II: Decision Pre-Trained Transformer is a Scalable In-Context Reinforcement Learner
-
This paper scales the Decision Pre-Trained Transformer (DPT) from simplified discrete environments to cross-domain continuous control scenarios involving 10 domains and 209 tasks. By replacing the Gaussian head with a rectified flow strategy head to model multimodal action distributions—while preserving the interpretation of DPT as "Bayesian posterior sampling"—the authors train a 928M-parameter universal Large Action Model capable of simultaneous online/offline operation. It significantly outperforms the previous Vintix and REGENT on 46 unseen tasks.
- Virne: A Comprehensive Benchmark for RL-based Network Resource Allocation in NFV
-
The authors propose Virne, a comprehensive benchmark framework for Network Function Virtualization Resource Allocation (NFV-RA), which integrates 30+ algorithms and gym-style environments to support systematic evaluation across multiple scenarios including cloud, edge, and 5G.
- Wavelet Predictive Representations for Non-Stationary Reinforcement Learning
-
WISDOM treats the sequence of "evolving tasks" in non-stationary RL as a non-stationary signal. It uses a learnable wavelet representation network to transform task representation sequences into the wavelet domain, combined with a wavelet TD update operator and an autoregressive loss to capture multi-scale evolutionary trends. This enables the policy to adapt rapidly in environments with random periods and sharp transitions, significantly outperforming baselines in sample efficiency and final performance.
- Webscale-RL: Automated Data Pipeline for Scaling RL Data to Pretraining Levels
-
This paper proposes the Webscale-RL automated data pipeline, which systematically converts trillion-token pretraining corpora into millions of "verifiable QA pairs" for RL training. By constructing a 1.2-million-sample RL dataset covering 9+ domains, GRPO training significantly outperforms continued pretraining and various data refinement baselines across multiple benchmarks, achieving comparable effects to continued pretraining with up to 100× fewer tokens.
- What Matters for Batch Online Reinforcement Learning in Robotics?
-
This paper provides a systematic empirical study. The authors decompose the paradigm where robots iteratively self-improve using large batches of self-collected data (batch online RL) into three axes: algorithm category, policy extraction method, and policy expressivity. They derive a robust "recipe": Value-based guidance (IQL) + Implicit Policy Extraction (sampling multiple actions and selecting the one with the highest Q-value) + Expressive Diffusion Policy, supplemented with temporal Ornstein–Uhlenbeck (OU) noise. This approach achieves up to 2× performance gains over imitation learning methods in six simulated manipulation tasks and improves the success rate of a real-world "tape-hanging" task by 30% over three iterations.
- When Greedy Wins: Emergent Exploitation Bias in Meta-Bandit LLM Training
-
The authors train LLMs as meta-bandit agents for Multi-Armed Bandit (MAB) tasks. Systematic comparisons between SFT and RL with three reward types reveal that while both can reduce cumulative regret to levels comparable to UCB/Thompson Sampling and generalize to \(6\times\) longer horizons, behavioral analysis shows these "improvements" largely stem from a more sophisticated yet greedier exploitation strategy. The agents are more prone to premature abandonment of exploration (increased suffix failure) compared to pre-trained models, even outperforming the UCB teacher they mimic by being "lazily greedy."
- When Is Diversity Rewarded in Cooperative Multi-Agent Learning?
-
This paper attributes the long-standing question of "when does a multi-agent team need division of labor" to a curvature criterion of the reward function. By decomposing the team reward into a two-step process—an "inner operator" aggregating agent efforts on individual tasks and an "outer operator" aggregating task scores—it proves that whenever the inner operator is Schur-convex (or the outer is Schur-concave), a heterogeneous team strictly outperforms the optimal homogeneous team. Furthermore, a gradient search algorithm based on differentiable simulators, HetGPS, is used to automatically discover "heterogeneity-demanding" reward structures in embodied MARL environments, yielding results perfectly consistent with theoretical predictions.
- Who Matters Matters: Agent-Specific Conservative Offline MARL
-
Addressing the "one-size-fits-all" conservatism applied to all agents in offline MARL, this paper proposes OMCDA: it decouples the Q-function into "reward" and "policy divergence" components, then dynamically allocates conservatism to each agent based on its influence on system returns. This allows high-influence agents to deviate more from the behavior policy while keeping low-influence agents cautious, consistently outperforming existing offline MARL methods on MuJoCo and SMAC.
- WIMLE: Uncertainty-Aware World Models with IMLE for Sample-Efficient Continuous Control
-
WIMLE extends Implicit Maximum Likelihood Estimation (IMLE) to model-based RL by learning stochastic world models that capture multimodal transition dynamics. It estimates prediction uncertainty through ensemble and latent sampling and utilizes an uncertainty-weighted RL objective for synthetic data. WIMLE achieves sample efficiency and asymptotic performance exceeding strong model-free and model-based baselines across 40 continuous control tasks.
- XQC: Well-Conditioned Optimization Accelerates Deep Reinforcement Learning
-
XQC does not rely on scaling up models or complex architectures. Instead, starting from the "condition number" of the critic loss landscape, it proves that the combination of BatchNorm + Weight Normalization + Categorical Cross-Entropy loss can reduce the Hessian condition number by several orders of magnitude and naturally bound gradient norms. This allows it to achieve SOTA sample efficiency on 70 continuous control tasks using ~4.5× fewer parameters.
- Zero-Shot Adaptation of Behavioral Foundation Models to Unseen Dynamics
-
This paper identifies that Behavioral Foundation Models (BFMs) based on Forward-Backward (FB) representations tend to average future occupancy distributions across different environments when trained on offline data with mixed dynamics, making them unable to adapt to unseen dynamical changes. The authors propose estimating a hidden dynamics belief using a transformer and conditioning the FB forward representation and task vector sampling on this belief. The proposed method significantly outperforms vanilla FB, LAP, HILP, and other zero-shot RL baselines in environments such as FourRooms, PointMass, AntWind, and OGBench Scene.