🎮 Reinforcement Learning¶

🔬 ICLR2026 · 142 paper notes

A Unifying View of Coverage in Linear Off-Policy Evaluation: This paper proposes a novel coverage parameter—feature-dynamics coverage—and conducts a new finite-sample analysis of the classical LSTDQ algorithm through an instrumental variable lens, unifying the various fragmented coverage definitions in linear off-policy evaluation.
AbstRaL: Augmenting LLMs' Reasoning by Reinforcing Abstract Thinking: This paper proposes AbstRaL, which uses reinforcement learning to teach LLMs to construct mathematical abstractions of reasoning problems (replacing concrete numbers/names with symbolic variables and extracting general formulas), then employs a symbolic solver to derive answers. AbstRaL nearly eliminates performance degradation caused by distribution shift on GSM perturbation benchmarks, and also yields implicit improvements on OOD mathematical and general reasoning tasks.
AMPED: Adaptive Multi-objective Projection for balancing Exploration and skill Diversification: This paper proposes AMPED, a framework that applies gradient surgery (PCGrad) during skill pretraining to balance gradient conflicts between exploration (entropy + RND) and skill diversity (AnInfoNCE), and employs a SAC-based skill selector during fine-tuning to adaptively choose the optimal skill. AMPED outperforms SBRL baselines including DIAYN, CeSD, and CIC on Maze and URLB benchmarks.
APPLE: Toward General Active Perception via Reinforcement Learning: This paper proposes APPLE, a general active perception framework that combines reinforcement learning with supervised learning. Active perception is formulated as a POMDP, with the reward defined as the RL reward minus the prediction loss. The gradient naturally decomposes into a policy gradient term and a prediction loss gradient term. Built upon off-policy algorithms (SAC/CrossQ) and a shared ViViT backbone, the framework is validated across 5 diverse task benchmarks. The CrossQ variant requires no per-task hyperparameter tuning and achieves a 53% improvement in training efficiency.
ARM-FM: Automated Reward Machines via Foundation Models for Compositional Reinforcement Learning: This paper proposes ARM-FM, a framework that leverages foundation models (e.g., GPT-4o) to automatically generate Language-Aligned Reward Machines (LARMs) from natural language task descriptions — encompassing the automaton structure, executable label functions, and per-state natural language descriptions — providing RL agents with compositional dense reward signals. The framework successfully solves sparse-reward long-horizon tasks that standard RL completely fails to learn, across environments including MiniGrid, Craftium (3D Minecraft), and Meta-World, while achieving zero-shot task generalization.
AutoQD: Automatic Discovery of Diverse Behaviors with Quality-Diversity Optimization: AutoQD is proposed to embed policy occupancy measures into a finite-dimensional space via random Fourier features (RFF), followed by weighted PCA for dimensionality reduction to obtain behavior descriptors, enabling QD optimization without manually designed BDs. It comprehensively outperforms hand-crafted BDs and existing unsupervised QD methods across 6 continuous control tasks.
AutoQD: Automatic Discovery of Diverse Behaviors with Quality-Diversity Optimization: This paper proposes AutoQD, which automatically generates behavior descriptors by embedding policy occupancy measures via random Fourier features, enabling the discovery of diverse, high-quality policies in continuous control tasks without manual descriptor design. Effectiveness is demonstrated across 6 standard environments.
AutoTool: Automatic Scaling of Tool-Use Capabilities in RL via Decoupled Entropy Constraints: This paper proposes a reinforcement learning framework with Decoupled Adaptive Entropy Constraints, enabling LLMs to automatically switch between long and short reasoning modes based on problem difficulty in tool-calling tasks, achieving a 9.8% accuracy improvement while reducing inference token overhead by approximately 81%.
AutoTool: Automatic Scaling of Tool-Use Capabilities in RL via Decoupled Entropy Constraints: This paper proposes AutoTool, which addresses reasoning collapse in direct RL training for LLM tool use and the overthinking problem in scaled models via a decoupled adaptive entropy constraint strategy. AutoTool enables automatic switching between long and short reasoning modes based on problem difficulty, achieving a 9.8% accuracy improvement while reducing reasoning token overhead by ~81%.
AWM: Accurate Weight-Matrix Fingerprint for Large Language Models: AWM is a training-free LLM weight-matrix fingerprinting method that recovers permutation and sign-flip transformations in the embedding layer via the Linear Assignment Problem (LAP), and then applies unbiased CKA to neutralize orthogonal transformations in Q/K matrices. It achieves perfect AUC (1.0) on 150 LLM pairs, is robust to six categories of post-training (SFT, continued pretraining up to 5.5T tokens, RL, multimodal extension, pruning, and upcycling), and completes within 30 seconds.
BA-MCTS: Bayes Adaptive Monte Carlo Tree Search for Offline Model-based RL: This work is the first to introduce Bayes Adaptive MDPs (BAMDPs) into offline model-based RL. It proposes Continuous BAMCP to handle Bayesian planning in continuous state/action spaces, combines pessimistic reward penalization with search-based policy iteration (an "RL + Search" paradigm), achieves significant improvements over 19 baselines on 12 D4RL tasks (Cohen's \(d > 1.8\)), and demonstrates successful application to tokamak fusion control.
Boolean Satisfiability via Imitation Learning: This paper proposes ImitSAT, the first imitation learning-based branching heuristic for CDCL solvers. By compressing solver runs into conflict-free KeyTrace expert sequences and framing branching decisions as an autoregressive prediction task conditioned on the decision prefix, ImitSAT significantly reduces propagation counts and solving time under a small query budget, and demonstrates strong generalization to structured SAT benchmarks.
Breaking Barriers: Do Reinforcement Post Training Gains Transfer To Unseen Domains?: Through an observational study (18 open-source RPT models) and an interventional study (single-domain GRPO training), this paper systematically reveals the generalization limitations of Reinforcement Post-Training (RPT/RLVR): RPT yields substantial within-domain gains, but cross-domain generalization is inconsistent — structured domains (math ↔ code) exhibit mutual transfer, whereas gains do not generalize to unstructured domains (law/finance/medicine). This finding holds consistently across algorithms, model scales, and training steps.
Chain-of-Context Learning: Dynamic Constraint Understanding for Multi-Task VRPs: This paper proposes Chain-of-Context Learning (CCL), which achieves stepwise dynamic constraint-aware decoding via Relevance-Guided Context Reformulation (RGCR, adaptively aggregating constraint information to construct context) and Trajectory-Shared Node Re-embedding (TSNR, sharing node updates across trajectories to avoid redundant computation). CCL comprehensively outperforms existing methods across 48 VRP variants (16 in-distribution + 32 out-of-distribution).
Co-rewarding: Stable Self-supervised RL for Eliciting Reasoning in Large Language Models: Co-rewarding proposes a self-supervised RL framework that addresses training collapse in self-rewarding RL through two complementary supervision perspectives: a data-side mechanism (cross-view consistency via contrastive paraphrased questions) and a model-side mechanism (EMA teacher model providing pseudo-labels). Without any human annotations, the framework matches or surpasses RLVR (with ground-truth labels) across multiple mathematical reasoning benchmarks.
Continuous-Time Value Iteration for Multi-Agent Reinforcement Learning: This paper proposes VIP (Value Iteration via PINN), the first framework to apply Physics-Informed Neural Networks (PINNs) for solving HJB PDEs in continuous-time multi-agent reinforcement learning. A Value Gradient Iteration (VGI) module is introduced to iteratively refine value gradients. VIP consistently outperforms both discrete-time and continuous-time baselines on continuous-time MPE and MuJoCo multi-agent tasks.
Controllable Exploration in Hybrid-Policy RLVR for Multi-Modal Reasoning: CalibRL reframes expert data as a distribution calibration baseline (rather than a strict imitation target), and achieves fine-grained control over the exploration–exploitation trade-off in MLLM reasoning training via asymmetric LeakyReLU activation combined with advantage weighting. This addresses entropy collapse in RLVR and substantially outperforms GRPO/DAPO on tasks such as geometric reasoning.
Cross-Embodiment Offline Reinforcement Learning for Heterogeneous Robot Datasets: This paper systematically investigates cross-embodiment offline RL pretraining, identifies gradient conflicts leading to negative transfer under increasing suboptimal data ratios and robot diversity, and proposes Embodiment Grouping (EG)—a strategy that clusters robots by morphological graph distance and updates the actor group-wise. On a locomotion benchmark spanning 16 robot platforms, EG substantially mitigates negative transfer (IQL+EG improves over IQL by 34% on the 70% suboptimal dataset).
CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning: This paper presents CUDA-L1, a three-stage pipeline framework based on Contrastive Reinforcement Learning (Contrastive RL), which trains an LLM with initially weak CUDA capabilities into an effective CUDA optimizer. The framework achieves an average 3.12× speedup across 250 CUDA kernels on KernelBench, with a peak speedup of 120×, and generalizes across GPU architectures.
Deep SPI: Safe Policy Improvement via World Models: This paper establishes a theoretical framework for Safe Policy Improvement (SPI) that unifies world models and representation learning with policy update guarantees: an importance-ratio-based neighborhood operator constrains policy updates to ensure monotonic improvement and convergence; local transition/reward losses control world model quality and representation stability. The proposed DeepSPI algorithm matches or surpasses PPO and DeepMDP on the ALE-57 benchmark.
Distributionally Robust Cooperative Multi-Agent Reinforcement Learning via Robust Value Factorization: This paper proposes the Distributionally Robust IGM (DrIGM) principle, integrating distributionally robust optimization into the value factorization framework of cooperative multi-agent RL, enabling classical methods such as VDN, QMIX, and QTRAN to maintain robust decentralized execution performance under distribution shift between training and deployment environments.
DiVE-k: Differential Visual Reasoning for Fine-grained Image Recognition: This paper proposes DiVE-k, a framework that constructs multiple-choice questions (MCQs) from the top-k outputs of a large vision-language model (LVLM) and trains the model via GRPO reinforcement learning to perform differential visual reasoning, achieving substantial improvements in base-to-novel generalization for fine-grained image recognition.
Divide, Harmonize, Then Conquer It: Shooting Multi-Commodity Flow Problems with Multimodal Language Models: This paper proposes Pram, the first framework to leverage multimodal language models (MLMs) for solving multi-commodity flow (MCF) problems. It decomposes the original problem into subproblems via partitioning and employs multi-agent reinforcement learning (MARL) to coordinate global consistency across subproblems. Theoretical convergence to the optimal solution is proven, and empirical results show that Pram is 1–2 orders of magnitude faster than LP solvers while achieving near-optimal performance.
Don't Just Fine-tune the Agent, Tune the Environment: This paper proposes the Environment Tuning training paradigm, which enables LLM agents to learn complex multi-turn tool use from scratch using only 400 training samples, through structured curriculum learning, actionable environment-augmented feedback, and fine-grained progress rewards, while achieving strong out-of-distribution generalization.
Dual-Robust Cross-Domain Offline Reinforcement Learning Against Dynamics Shifts: This work is the first to simultaneously address train-time robustness (source–target domain dynamics mismatch) and test-time robustness (deployment-environment dynamics shift) in cross-domain offline RL. The proposed DROCO algorithm centers on the Robust Cross-Domain Bellman (RCB) operator—applying a robust Bellman update to source-domain data and a standard in-sample update to target-domain data—and reformulates intractable dynamics uncertainty as state-space perturbations via dual reconstruction. On the D4RL benchmark, DROCO achieves a total score of 1105.2, surpassing the second-best method by 14%, while exhibiting performance degradation under hard-level dynamics perturbations that is only half that of the baselines.
Dual Goal Representations: This paper proposes dual goal representations, which encode a goal state via the set of optimal temporal distances from all states to that goal. The authors theoretically prove that this representation is sufficient for recovering the optimal policy and naturally filters exogenous noise. A practical learning algorithm based on asymmetric inner product parameterization is designed, and the resulting module consistently improves three mainstream offline GCRL methods across 20 OGBench tasks as a plug-and-play component.
DVLA-RL: Dual-Level Vision-Language Alignment with Reinforcement Learning Gating for Few-Shot Learning: This paper proposes DVLA-RL, a framework that employs Dual-level Semantic Construction (DSC) to generate complementary low-level attributes and high-level descriptions, and uses RL-gated Attention (RLA) to dynamically balance the contributions of self-attention and cross-attention across different network layers. This achieves hierarchical vision-language alignment from low to high levels, attaining state-of-the-art performance on 9 few-shot learning benchmarks.
Echo: Towards Advanced Audio Comprehension via Audio-Interleaved Reasoning: This paper proposes a novel paradigm called audio-interleaved reasoning, which treats audio as an active component during inference rather than a static context, enabling LALMs to dynamically locate and re-listen to audio segments during the reasoning process. Through a two-stage SFT+RL training framework and a structured data generation pipeline, the authors build the Echo model, which surpasses GPT-4o and Gemini-2.0-Flash on both expert-level and general audio understanding benchmarks.
Efficient Estimation of Kernel Surrogate Models for Task Attribution: This paper proposes a kernel surrogate model (KernelSM) for task attribution. By employing RBF kernel ridge regression to capture nonlinear interaction effects among tasks, combined with a gradient-projection-based efficient estimation algorithm that eliminates repeated retraining, KernelSM achieves a 25% improvement in correlation over linear surrogate and influence function baselines across mathematical reasoning, in-context learning, and multi-objective RL settings.
EGG-SR: Embedding Symbolic Equivalence into Symbolic Regression via Equality Graph: This paper proposes Egg-SR, a unified framework that embeds symbolic equivalence via equality graphs (e-graphs) into three categories of symbolic regression methods—MCTS, DRL, and LLM—achieving subtree pruning, policy gradient variance reduction, and feedback prompt enrichment, respectively. Theoretical results prove that Egg-MCTS tightens the regret bound and Egg-DRL reduces gradient estimation variance, while experiments consistently validate improved expression discovery accuracy.
Emergence of Spatial Representation in an Actor-Critic Agent with Hippocampus-Inspired Sequence Generator: Inspired by the intrinsic recurrent circuitry of hippocampal region CA3, this paper proposes a minimal sequence generator (shift register) integrated with an actor-critic framework to achieve maze navigation under sparse visual input, while giving rise to neurobiologically observed phenomena including place fields, DG orthogonalization, distance-dependent spatial kernels, and task-dependent remapping.
Entropy-Preserving Reinforcement Learning (REPO / ADAPO): This paper identifies the theoretical root cause of systematic policy entropy collapse in policy gradient RL algorithms for LLM post-training — namely, the positive correlation between advantage functions and log-probabilities — and proposes two complementary solutions: REPO (decorrelating the advantage function) and ADAPO (adaptive asymmetric clipping), achieving state-of-the-art performance on interactive tool-use tasks.
ExGRPO: Learning to Reason from Experience: This paper presents the first systematic study of what types of reasoning experiences are most valuable for RLVR, finding that medium-difficulty problems paired with low-entropy trajectories are most effective. Based on these findings, it proposes the ExGRPO framework for experience management and mixed-policy optimization, achieving an average gain of +3.5 points on mathematical reasoning and +7.6 points on general reasoning.
Exploration vs Exploitation: Rethinking RLVR through Clipping, Entropy, and Spurious Reward: Through theoretical derivation and cross-model experiments, this paper demonstrates that the learning signal provided by clipping bias in RLVR is negligible (≤1/17); the true effect of clipping is an implicit entropy compression on the policy. A reward mislabeling model is further proposed to explain why random rewards can benefit stronger models.
FAPO: Flawed-Aware Policy Optimization for Efficient and Reliable Reasoning: To address the problem of flawed-positive rollouts in RLVR training—where the model reaches a correct answer through unreliable reasoning—this paper proposes the FAPO algorithm. FAPO employs a GenRM to detect flawed reasoning and applies a parameter-free reward penalty mechanism that realizes a natural "exploit-then-suppress" learning trajectory, simultaneously improving outcome correctness, process reliability, and training stability.
Flow Actor-Critic for Offline Reinforcement Learning (FAC): FAC is the first method to jointly leverage a continuous normalizing flow model to simultaneously construct an expressive actor policy and a critic penalty mechanism based on exact density estimation. By identifying OOD regions for selective conservative Q-value estimation, FAC achieves an average score of 60.3 across 55 OGBench tasks, substantially outperforming the previous best of 43.6.
From Narrow to Panoramic Vision: Attention-Guided Cold-Start Reshapes Multimodal Reasoning: This work discovers that the reasoning performance of multimodal LLMs is highly correlated with the Visual Attention Score (VAS) (\(r=0.96\)), and proposes the AVAR framework, which improves VAS through three stages—visual-anchored data synthesis, attention-guided training objectives, and visual-anchored reward shaping—achieving an average improvement of 7% across 77 benchmarks.
From Observations to Events: Event-Aware World Model for Reinforcement Learning: This paper proposes the Event-Aware World Model (EAWM), a general framework that automatically generates events from raw observations and learns event-aware representations without manual annotations, improving existing MBRL baselines by 10%–45% and achieving new state-of-the-art results on Atari 100K, Craftax 1M, DeepMind Control 500K, and DMC-GB2 500K.
From Verifiable Dot to Reward Chain: Harnessing Verifiable Reference-based Rewards for RL of Open-ended Generation: This paper proposes RLVRR, a framework that extends RLVR (reinforcement learning with verifiable rewards) from mathematical/code reasoning to open-ended text generation. It extracts hierarchical keyword sequences (content rewards) and executable Python checking functions (style rewards) from high-quality reference answers, forming a "reward chain" to replace single-point verification signals. On 10+ benchmarks, RLVRR trained on 10K examples outperforms 100K SFT and advanced reward models.
GraphOmni: A Comprehensive and Extensible Benchmark Framework for Large Language Models on Graph-theoretic Tasks: This paper proposes GraphOmni, a benchmark framework that systematically evaluates the graph-theoretic reasoning capabilities of 11 LLMs across 241K queries spanning 7 graph types × 7 serialization formats × 9 prompting strategies, reveals complex interaction effects among these three dimensions, and introduces an RL-guided combinatorial search method that achieves approximately 90% of optimal accuracy at roughly 25% of the evaluation cost.
Helix: Evolutionary Reinforcement Learning for Open-Ended Scientific Problem Solving: This paper proposes HELIX, a framework that integrates reinforcement learning (GRPO) with evolutionary algorithms (NSGA-II) for open-ended scientific problem solving. RL iteratively optimizes the policy, evolutionary mechanisms balance solution quality and diversity, and in-context learning leverages historical solutions to guide exploration. Using only a 14B model, HELIX surpasses GPT-4o pipelines across 20 tasks spanning circle packing, machine learning optimization, and more.
How Far Can Unsupervised RLVR Scale LLM Training?: This paper presents a comprehensive analysis of Unsupervised Reinforcement Learning from Verifiable Rewards (URLVR), demonstrating that all intrinsic reward methods fundamentally operate as a "sharpening" mechanism over the model's initial distribution, leading to an inevitable rise-then-fall collapse pattern. It proposes the Model Collapse Step as a prior-based model indicator and identifies external reward methods as the key direction for overcoming scalability bottlenecks.
How LLMs Learn to Reason: A Complex Network Perspective: This paper proposes a "sparse concept network" theory from a complex network perspective to provide a unified explanation of four puzzling phenomena in RLVR training (V-shaped response length, two-stage learning curve, catastrophic forgetting, and policy collapse). It reveals that all four phenomena originate from the topological self-organization of sparse reasoning graphs with average degree approximately 2, and derives the Annealed-RLVR algorithm, which surpasses standard RLVR on mathematical reasoning benchmarks.
InFOM: Intention-Conditioned Flow Occupancy Models: InFOM learns a latent intention encoder via variational inference and models intention-conditioned discounted state occupancy measures using flow matching, enabling efficient pre-training and fine-tuning in RL. It achieves 1.8× median return and 36% higher success rate over baselines across 36 state-based tasks and 4 image-based tasks.
Is Pure Exploitation Sufficient in Exogenous MDPs with Linear Function Approximation?: This paper proves that pure exploitation (without exploration) suffices to achieve sublinear regret in Exogenous MDPs (Exo-MDPs), where uncertainty arises solely from exogenous inputs independent of agent actions. In the tabular setting, the PTO algorithm attains \(\tilde{O}(H^2|\Xi|\sqrt{K})\) regret; under linear function approximation, the LSVI-PE algorithm achieves regret that is polynomial in the feature dimension and the exogenous state space, yet independent of the endogenous state and action spaces.
LadderSym: A Multimodal Interleaved Transformer for Music Practice Error Detection: This paper proposes the LadderSym architecture for music practice error detection. It addresses insufficient cross-stream alignment in late-fusion approaches via an interleaved cross-stream alignment module (Ladder), and reduces frequency ambiguity in audio-only score representations by incorporating symbolic score prompts (Sym). On MAESTRO-E, the missed-note F1 score improves from 26.8% to 56.3%.
Latent Wasserstein Adversarial Imitation Learning: LWAIL leverages ICVF to learn a dynamics-aware latent representation from a small amount of random data, replacing the Euclidean ground metric in Wasserstein-based imitation learning with a latent-space distance. The method achieves expert-level imitation performance using only a single state-only expert trajectory.
Learning from Synthetic Data Improves Multi-hop Reasoning: This paper finds that RLVR training on fully fictitious, rule-generated synthetic data significantly improves LLM performance on real-world multi-hop reasoning tasks (56%–131% gains for Qwen3-0.6B), because the model learns knowledge composition as a generalizable reasoning skill rather than memorizing factual knowledge.
Learning to Generate Unit Test via Adversarial Reinforcement Learning: This paper proposes UTRL, a framework that iteratively trains a unit test generator and a code generator via adversarial RL — the test generator learns to produce discriminative test cases that distinguish LLM-generated code from correct solutions, while the code generator learns to pass those tests. A Qwen3-4B model trained with UTRL surpasses GPT-4.1 in test generation quality.
Learning to Orchestrate Agents in Natural Language with the Conductor: A 7B Qwen2.5 model is trained via GRPO as a "Conductor" that outputs complete agent workflows in natural language—comprising subtask instructions, worker assignments, and communication topology access lists—to coordinate frontier models such as GPT-5, Claude Sonnet 4, and Gemini 2.5 Pro. Trained on only 960 questions × 200 iterations, the Conductor achieves an average accuracy of 77.27% across 7 reasoning benchmarks, surpassing all single-model baselines (GPT-5: 74.78%) and multi-agent baselines.
Learning to Play Multi-Follower Bayesian Stackelberg Games: This paper provides the first systematic study of online learning in multi-follower Bayesian Stackelberg Games (BSGs). By geometrically partitioning the leader's strategy space into best-response regions, it achieves a regret bound of \(\tilde{O}(\sqrt{\min\{L, nK\} \cdot T})\) under type feedback — a bound that does not grow polynomially in the number of followers \(n\) — and establishes a nearly matching lower bound of \(\Omega(\sqrt{\min\{L, nK\}T})\).
Less is More: Clustered Cross-Covariance Control for Offline RL: This paper identifies that the standard squared-error TD objective introduces harmful cross-covariance in offline RL, and proposes C⁴ (Clustered Cross-Covariance Control for TD), which mitigates this effect via partitioned buffer sampling and an explicit gradient-based corrective penalty, achieving up to 30% return improvement in small-dataset and OOD-dominated settings.
LongRLVR: Long-Context Reinforcement Learning Requires Verifiable Context Rewards: This paper proposes LongRLVR, which introduces verifiable context rewards into RLVR training to address the gradient vanishing problem of contextual grounding caused by relying solely on final-answer rewards in long-context settings, significantly improving LLM long-context reasoning capabilities.
LongWriter-Zero: Mastering Ultra-Long Text Generation via Reinforcement Learning: This paper proposes LongWriter-Zero: starting from a base model, without relying on any annotated or synthetic data, the approach uses GRPO reinforcement learning combined with a three-dimensional composite reward model (length / quality / format) to elicit emergent ultra-long, high-quality text generation. With 32B parameters, the model surpasses 100B+ models such as DeepSeek-R1 and Qwen3-235B on WritingBench.
LoongRL: Reinforcement Learning for Advanced Reasoning over Long Contexts: This paper proposes LoongRL, which constructs KeyChain synthetic data for reinforcement learning training to elicit a plan–retrieve–reason–recheck reasoning pattern in LLMs for long-context tasks. Training solely on 16K contexts generalizes to 128K; the 14B model achieves 74.2, approaching o3-mini (74.5) and DeepSeek-R1 (74.9).
MARS-Sep: Multimodal-Aligned Reinforced Sound Separation: MARS-Sep reformulates query-conditioned sound separation as a reinforcement learning problem, performing stochastic decisions over time-frequency bins via a factorized Beta mask policy, and leverages a progressively aligned multimodal encoder to provide semantic reward signals, achieving simultaneous improvements in signal fidelity and semantic consistency.
Menlo: From Preferences to Proficiency – Evaluating and Modeling Native-like Quality Across 47 Languages: This paper proposes the Menlo framework, which decomposes native-like response quality into four dimensions grounded in audience design theory, constructs a preference dataset of 6,423 annotated pairs covering 47 language varieties, and demonstrates that pairwise evaluation combined with RL-trained LLM judges can approach human annotator agreement levels.
MergeMix: A Unified Augmentation Paradigm for Visual and Multi-Modal Understanding: MergeMix proposes a token merging–based mixup data augmentation method that generates mixed images in attention space via bipartite soft matching, uses the mixing ratio as a soft margin in preference optimization, and unifies SFT and RL training paradigms across image classification and multimodal large language model settings.
Metis-SPECS: Decoupling Multimodal Learning via Self-distilled Preference-based Cold Start: This paper proposes SPECS, a three-stage cold-start framework that (1) generates preference data via self-distillation (distinguishing only format differences), (2) applies DPO for format pre-alignment as the cold start, and (3) follows with GRPO fine-tuning. By decoupling format learning from reasoning learning, SPECS achieves consistent performance gains of +4.1% on MEGA-Bench and +12.2% on MathVista.
ROMI: Model-based Offline RL via Robust Value-Aware Model Learning with Implicitly Differentiable Adaptive Weighting: ROMI achieves robust value-aware model learning by converting the dynamics uncertainty set into a state uncertainty set via Wasserstein duality, and employs an implicitly differentiable adaptive weighting mechanism to balance dynamics accuracy against value-awareness. This resolves the Q-value underestimation and gradient explosion issues in RAMBO, achieving state-of-the-art performance among model-based offline RL methods on D4RL and NeoRL.
Model Predictive Adversarial Imitation Learning for Planning from Observation: This paper proposes MPAIL (Model Predictive Adversarial Imitation Learning), which embeds an MPPI planner natively into the adversarial imitation learning loop, achieving the first end-to-end Planning-from-Observation (PfO) framework. MPAIL comprehensively outperforms policy-based AIL methods in generalization, robustness, interpretability, and sample efficiency, and is successfully deployed on a real-world robot navigation task from a single observed demonstration.
MoMaGen: Generating Demonstrations under Soft and Hard Constraints for Multi-Step Bimanual Mobile Manipulation: MoMaGen formulates demonstration data generation for bimanual mobile manipulation as a constrained optimization problem. By combining hard constraints (reachability, collision-free motion, visibility) with soft constraints (object visibility during navigation, retraction to compact poses), the framework automatically generates large-scale, diverse datasets from a single human teleoperation demonstration. The resulting visuomotor policy can be deployed on a physical robot with only 40 real demonstrations for fine-tuning.
MVR: Multi-view Video Reward Shaping for Reinforcement Learning: This paper proposes the MVR framework, which learns a state relevance function from multi-view video via video-text similarity matching. Combined with state-dependent reward shaping that automatically attenuates VLM guidance, MVR outperforms existing VLM-based reward methods across 19 tasks on HumanoidBench and MetaWorld.
Near-Optimal Second-Order Guarantees for Model-Based Adversarial Imitation Learning: This paper proposes MB-AIL (Model-Based Adversarial Imitation Learning), establishing horizon-free second-order sample complexity upper bounds under general function approximation. Combined with information-theoretic lower bounds on constructed hard instances, MB-AIL is shown to be minimax optimal (up to logarithmic factors) in terms of online interaction sample complexity.
Nearly-Optimal Bandit Learning in Stackelberg Games with Side Information: By linearizing the leader's utility space in Stackelberg games, this paper proposes a reduction to linear contextual bandits that improves the regret bound from \(\tilde{O}(T^{2/3})\) to the nearly-optimal \(\tilde{O}(T^{1/2})\) under bandit feedback with side information.
Offline Reinforcement Learning with Generative Trajectory Policies: This paper proposes the Generative Trajectory Policy (GTP), which unifies diffusion models, flow matching, and consistency models by learning the complete solution mapping of an ODE. Combined with two key adaptation techniques—score approximation and value-guided weighting—GTP achieves state-of-the-art performance on D4RL.
On Discovering Algorithms for Adversarial Imitation Learning: This paper proposes DAIL — the first meta-learned adversarial imitation learning algorithm. It decomposes AIL into two stages (density ratio estimation and reward assignment), and employs LLM-guided evolutionary search to automatically discover an optimal reward assignment (RA) function \(r_{\text{disc}}\), achieving generalization to unseen environments and policy optimizers while surpassing all manually designed baselines.
On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification: This paper mathematically proves, from an RL policy gradient perspective, that SFT gradients implicitly encode a pathological reward structure with inverse probability weighting (\(1/\pi_\theta\)), causing excessively large gradients on low-probability tokens and limiting generalization. The paper proposes DFT (Dynamic Fine-Tuning), which requires only a one-line code modification (multiplying the CE loss by the token probability: \(-p\log p\)) to eliminate inverse probability weighting. DFT substantially outperforms SFT on mathematical reasoning, code generation, and multimodal tasks, and even surpasses GRPO/PPO in the offline RL setting.
On the \(O(1/T)\) Convergence of Alternating Gradient Descent-Ascent in Bilinear Games: This paper provides the first proof that alternating gradient descent-ascent (AltGDA) converges to a Nash equilibrium at an \(O(1/T)\) rate in constrained bilinear zero-sum games (when an interior NE exists), outperforming simultaneous GDA's \(O(1/\sqrt{T})\) rate. The analysis characterizes the "friction" effect produced when trajectories collide with the boundary via an energy function decay argument, and further optimizes step sizes through performance estimation programming (PEP).
One Model for All Tasks: Leveraging Efficient World Models in Multi-Task Planning: This paper proposes ScaleZero, which incorporates a Mixture-of-Experts (MoE) architecture into a unified world model to address gradient conflict and plasticity collapse in multi-task learning. Combined with a Dynamic Parameter Scaling (DPS) strategy that adaptively allocates model capacity, a single multi-task model achieves performance comparable to single-task expert models across three benchmarks (Atari/DMC/Jericho) while reducing environment interactions by approximately 28.5%.
Online Minimization of Polarization and Disagreement via Low-Rank Matrix Bandits: This paper is the first to formalize the problem of minimizing polarization and disagreement under the Friedkin-Johnsen opinion dynamics model as an online low-rank matrix bandit problem (OPD-Min). A two-phase algorithm, OPD-Min-ESTR, is proposed that reduces the dimensionality from \(|V|^2\) to \(O(|V|)\) via subspace estimation, achieving substantial improvements over full-dimensional linear bandit baselines on both synthetic and real-world networks.
Online Prediction of Stochastic Sequences with High Probability Regret Bounds: This paper revisits the classical problem of universal prediction of stochastic sequences over a finite horizon \(T\), and establishes, for the first time, vanishing regret bounds that hold with high probability in the form \(O(T^{-1/2}\delta^{-1/2})\). These bounds closely mirror the existing expected regret bound of \(O(T^{-1/2})\), and the paper further proves that the exponent of \(\delta\) cannot be improved without additional assumptions.
Optimistic Task Inference for Behavior Foundation Models: This paper proposes OpTI-BFM — a test-time task inference method for Behavior Foundation Models that requires neither a complete reward function nor an annotated dataset, and recovers oracle performance within approximately 5 episodes of environment interaction. The core insight is to exploit the linear structure of successor features to reduce task inference to a linear bandit problem, employing a UCB strategy for optimistic exploration in task embedding space, with formal regret guarantees.
P-GenRM: Personalized Generative Reward Model with Test-time User-based Scaling: This paper proposes P-GenRM, the first personalized generative reward model. Through a three-stage training pipeline—PSI supervised fine-tuning to construct structured evaluation chains, CRE reinforcement learning to enhance reasoning under missing preference signals, and hard-negative curriculum learning to improve robustness—P-GenRM converts mixed preference signals into context-adaptive user personas and scoring rubrics. At inference time, a dual-granularity test-time scaling strategy is introduced: individual-level multi-sample aggregation and prototype-level collaborative filtering that borrows preferences from similar users. P-GenRM surpasses the previous SOTA by 2.31% on PersonalRewardBench, with test-time scaling yielding an additional ~3% gain, while generalizing to unseen users.
ParaS2S: Benchmarking and Aligning Spoken Language Models for Paralinguistic-Aware Speech-to-Speech Interaction: This paper proposes the ParaS2S framework, which comprises ParaS2SBench — a benchmark for evaluating paralinguistic-aware (emotion/sarcasm/age/gender) speech-to-speech interaction — and ParaS2SAlign, a GRPO-based RL alignment framework that enables S2S models to learn style-adaptive response generation with minimal labeled data.
Partially Equivariant Reinforcement Learning in Symmetry-Breaking Environments: This paper proposes the Partially Invariant MDP (PI-MDP) framework, which employs a learnable gating function \(\lambda(s,a)\) to pointwise switch between equivariant and standard Bellman updates across the state-action space. The paper theoretically proves that local symmetry breaking propagates through discounted backup and amplifies global value function error by a factor of \(1/(1-\gamma)\), while PI-MDP provably confines the error strictly within the breaking region. The framework is instantiated as PE-DQN and PE-SAC, achieving comprehensive improvements over strictly equivariant and approximately equivariant baselines on Grid-World, MuJoCo locomotion, and robotic manipulation tasks.
PolicyFlow: Policy Optimization with Continuous Normalizing Flow in Reinforcement Learning: PolicyFlow seamlessly integrates a continuous normalizing flow (CNF) policy into the PPO framework: it approximates the importance ratio via velocity field differences along an interpolated path (avoiding full ODE path backpropagation), and introduces a Brownian motion-inspired implicit entropy regularizer to prevent mode collapse. The method matches or surpasses Gaussian PPO and flow-based baselines (FPO/DPPO) across MultiGoal, PointMaze, IsaacLab, and MuJoCo environments.
Post-training Large Language Models for Diverse High-Quality Responses: This paper proposes DQO (Diversity Quality Optimization), which defines a diversity metric in semantic embedding space via determinantal point processes (DPP), and jointly optimizes it with reward signals to simultaneously improve semantic diversity and response quality during LLM post-training. DQO can be stacked on top of GRPO/PPO.
PreferThinker: Reasoning-based Personalized Image Preference Assessment: This paper proposes PreferThinker, which introduces a universal visual preference profile to bridge across different users and adopts a predict-then-assess CoT reasoning paradigm for interpretable personalized image preference assessment. Combined with cold-start SFT and GRPO reinforcement learning along with a similarity-aware prediction reward, the 7B model outperforms GPT-4o (+5.2%) and Claude 3.7 (+5.1%).
Principled Fast and Meta Knowledge Learners for Continual Reinforcement Learning: Inspired by the hippocampus–neocortex interaction in the human brain, this paper proposes FAME, a dual-learner framework for continual reinforcement learning that employs a fast learner for knowledge transfer and a meta learner for knowledge consolidation, achieving efficient continual RL while principally minimizing catastrophic forgetting.
Pruning as a Cooperative Game: Surrogate-Assisted Layer Contribution Estimation for Large Language Models: This paper models LLM layer pruning as a cooperative game, employing a lightweight surrogate network to approximate Shapley values that capture inter-layer dependencies, achieving superior deep pruning performance over static heuristic methods.
QuRL: Efficient Reinforcement Learning with Quantized Rollout: This paper proposes QuRL, a method that quantizes the actor model to accelerate the rollout phase in RL training. It introduces Adaptive Clipping Range (ACR) to address training collapse caused by quantization, and Update-Aware Quantization (UAQ) to resolve the scale mismatch between weight updates and quantization error. QuRL achieves 20%–80% inference throughput improvement without performance degradation.
REA-RL: Reflection-Aware Online Reinforcement Learning for Efficient Reasoning: REA-RL is a framework that employs a distilled small reflection model to online detect and truncate overthinking tokens, generating revised reasoning paths, while incorporating a reflection reward to prevent model degradation into non-reflective vanilla CoT during RL training. On DeepSeek-R1-Distill-Qwen-7B, it achieves a 36% reduction in reasoning token consumption with zero accuracy loss.
Reasoning as Representation: Rethinking Visual Reinforcement Learning in Image Quality Assessment: Through systematic experimentation, this paper reveals the fundamental mechanism underlying the generalization capability of RL-trained reasoning-based IQA models — the reasoning process essentially transforms redundant visual representations into compact, cross-domain aligned textual representations. Building on this insight, the paper proposes the RALI algorithm, which directly aligns image representations to these textual representations via contrastive learning, achieving comparable generalization performance with less than 5% of the parameters and inference time.
Reasoning Boosts Opinion Alignment in LLMs: GRPO-based reinforcement learning is applied to train LLMs to align with individual political opinions via structured reasoning. SFT+GRPO consistently outperforms ICL and ORPO baselines across U.S., German, and Swiss datasets, while systematically revealing left–right ideological asymmetry and fundamental difficulty in predicting Neutral stances.
RebuttalAgent: Strategic Persuasion in Academic Rebuttal via Theory of Mind: This paper is the first to introduce Theory of Mind (ToM) into academic rebuttal, proposing a three-stage ToM-Strategy-Response (TSR) framework: first modeling the reviewer's mental state, then formulating a persuasion strategy, and finally generating evidence-grounded responses. Combined with self-reward RL training and a dedicated Rebuttal-RM evaluator, the approach achieves an average improvement of 18.3% over the base model.
References Improve LLM Alignment in Non-Verifiable Domains: This paper proposes RefEval, a reference-guided LLM-as-Judge framework that uses high-quality reference outputs as "soft verifiers," improving LLM-judge accuracy by 6.8%. Building on this, the authors design a two-stage self-improvement pipeline (SFT distillation + reference-guided DPO) that outperforms SFT distillation alone by +19.2/+16.5 on AlpacaEval/Arena-Hard, matching the performance of the fine-tuned reward model ArmoRM — demonstrating that effective LLM alignment in non-verifiable domains is achievable without human preference annotation.
ReFORM: Reflected Flows for On-support Offline RL via Noise Manipulation: ReFORM is proposed to manipulate the source distribution of a behavior cloning (BC) flow policy by learning a reflected flow noise generator, achieving support constraints in a constructive manner that avoids OOD issues while preserving policy expressiveness, without requiring hyperparameter tuning.
Regret-Guided Search Control for Efficient Learning in AlphaZero: This paper proposes RGSC (Regret-Guided Search Control), a framework that trains a regret network to identify high-regret states and prioritizes restarting self-play from these states, emulating the human learning strategy of repeatedly reviewing mistakes. RGSC outperforms AlphaZero by an average of 77 Elo across 9×9 Go, 10×10 Othello, and 11×11 Hex.
Pruning as a Cooperative Game: Surrogate-Assisted Layer Contribution Estimation for Large Language Models: Layer pruning in LLMs is formulated as a cooperative game (each layer = player, model performance = utility) → exact Shapley value computation is infeasible (\(2^L\) combinations) → a two-stage approximation is proposed: (1) stratified Monte Carlo sampling generates masks + evaluates PPL as supervision signals → (2) a lightweight surrogate network is trained to predict the performance of arbitrary masks → efficient per-layer Shapley value estimation → captures inter-layer dependencies → substantially outperforms static heuristic pruning baselines.
ReMix: Reinforcement Routing for Mixtures of LoRAs in LLM Finetuning: ReMix identifies a severe routing weight collapse problem in existing Mixture-of-LoRAs models (even when \(k>1\) LoRAs are activated, the effective LoRA count rapidly degenerates to 1), proposes non-learnable constant routing weights to ensure equal contribution from all activated LoRAs, and trains the router using the RLOO reinforcement learning gradient estimator, significantly outperforming state-of-the-art PEFT methods.
ReMoT: Reinforcement Learning with Motion Contrast Triplets: ReMoT proposes a unified training paradigm that systematically enhances VLM spatiotemporal consistency reasoning through a rule-driven motion contrast triplet dataset (ReMoT-16K) and Group Relative Policy Optimization (GRPO) with composite reward optimization, achieving a 25.1% performance gain on spatiotemporal reasoning benchmarks.
Retaining Suboptimal Actions to Follow Shifting Optima in Multi-Agent RL: This paper proposes S2Q (Successive Sub-value Q-learning), which explicitly retains suboptimal joint actions by successively learning \(K\) sub-value functions. Combined with a Softmax behavior policy for prioritized sampling among candidates, S2Q addresses the root cause of suboptimal convergence in cooperative MARL value decomposition methods—namely, that policy optima shift dynamically during training.
Rethinking Policy Diversity in Ensemble Policy Gradient in Large-Scale Reinforcement Learning: This paper theoretically analyzes how inter-policy diversity affects learning efficiency in ensemble policy gradient methods, and proposes Coupled Policy Optimization (CPO), which regulates diversity via KL divergence constraints to achieve efficient and stable exploration in large-scale parallel environments.
Revisiting Matrix Sketching in Linear Bandits: Achieving Sublinear Regret via Dyadic Block Sketching: This paper identifies a fundamental flaw in existing sketch-based linear bandit methods: when the spectrum of the streaming matrix has a heavy tail, these methods degenerate to linear regret. To address this, the paper proposes the Dyadic Block Sketching (DBS) framework, which dynamically doubles the sketch size to control the global approximation error within a user-specified parameter \(\epsilon\). The resulting algorithm guarantees sublinear regret without requiring prior knowledge of the spectral structure of the streaming matrix, and adaptively recovers the computational efficiency of single-scale methods when the spectrum is favorable.
RewardMap: Tackling Sparse Rewards in Fine-grained Visual Reasoning via Multi-Stage Reinforcement Learning: This paper proposes the RewardMap framework, which addresses the sparse reward problem in fine-grained visual reasoning through difficulty-aware detail reward design and a multi-stage RL curriculum that progresses from simple perception to complex reasoning.
RLP: Reinforcement as a Pretraining Objective: This paper proposes RLP (Reinforcement Learning Pretraining), an information-gain-driven RL pretraining objective that rewards Chain-of-Thought (CoT) reasoning when it improves next-token prediction probability. RLP shifts reinforcement learning from the post-training stage into pretraining, enabling dense reward signals without any verifier.
RM-R1: Reward Modeling as Reasoning: This paper reframes reward modeling as a reasoning task, introducing the RM-R1 family of Reasoning Reward Models (ReasRM). Through reasoning distillation combined with RL training and a Chain-of-Rubrics (CoR) mechanism, RM-R1 outperforms 70B and GPT-4o models by an average of 4.9% across three major reward model benchmarks.
Robust Deep Reinforcement Learning against Adversarial Behavior Manipulation: This paper studies a novel threat in RL—behavior-targeted attacks (where an adversary manipulates observations to steer the victim toward executing a specific target policy)—and proposes BIA, a black-box attack method, along with TDRT, a temporally discounted robust training defense. TDRT achieves robustness against such attacks while outperforming the existing defense SA-PPO on original task performance by 28.2%.
Robust Multi-Objective Controlled Decoding of Large Language Models: This paper proposes RMOD (Robust Multi-Objective Decoding), an inference-time algorithm that dynamically computes worst-case objective weights by solving for the Nash equilibrium of a minimax game, achieving robust multi-objective alignment of LLMs without requiring any prior knowledge of objective weights.
Routing, Cascades, and User Choice for LLMs: This paper models LLM routing as a provider-user Stackelberg game, proves that the optimal routing policy is almost always a static, cascade-free threshold rule, reveals user-provider misalignment when quality/cost rankings are inconsistent, and shows that under low churn penalties providers are incentivized to inflate latency via throttling to reduce cost at the expense of user utility.
RuleReasoner: Reinforced Rule-based Reasoning via Domain-aware Dynamic Sampling: RuleReasoner constructs a diverse rule reasoning dataset, RuleCollection-32K, and proposes a domain-aware dynamic sampling strategy (Dads). Under the RLVR framework, an 8B model trained with this approach outperforms OpenAI-o1 by 4.1% on in-distribution reasoning tasks and by 10.4% on out-of-distribution tasks, while achieving approximately 1.4× training efficiency improvement.
Safe Continuous-time Multi-Agent Reinforcement Learning via Epigraph Form: This paper proposes the first continuous-time multi-agent RL framework that explicitly handles state constraints. By reformulating the discontinuous constrained value function into a continuous representation via the epigraph form, and combining an improved PINN-based actor-critic method, the framework achieves safe and stable continuous-time multi-agent control.
Sample-efficient and Scalable Exploration in Continuous-Time RL: This paper proposes COMBRL, an algorithm that achieves scalable and sample-efficient exploration in continuous-time model-based RL by maximizing a weighted sum of extrinsic reward and epistemic uncertainty, with theoretical guarantees of sublinear regret.
Scalable Exploration for High-Dimensional Continuous Control via Value-Guided Flow: This paper proposes Qflex (Q-guided Flow Exploration), a scalable RL method for exploration in high-dimensional continuous action spaces. It transports actions from a learnable source distribution along a probability flow induced by the Q-function, aligning exploration with task-relevant gradients rather than isotropic noise. Qflex outperforms Gaussian and diffusion-based RL baselines across various high-dimensional benchmarks, and successfully controls a full-body musculoskeletal model with 700 actuators to perform agile and complex motions.
Scalable In-Context Q-Learning: This paper proposes S-ICQL, which introduces dynamic programming (Q-learning) and world models into the supervised ICRL framework. A multi-head Transformer simultaneously predicts the policy and in-context value functions, a pretrained world model constructs lightweight yet accurate prompts, and advantage-weighted regression is used for policy extraction. S-ICQL consistently outperforms all baselines when learning from suboptimal data in both discrete and continuous environments.
Self-Harmony: Learning to Harmonize Self-Supervision and Self-Play in Test-Time Reinforcement Learning: This paper proposes the Self-Harmony framework, in which a single model plays two roles—a Solver that addresses the original problem and a Reframer that paraphrases it—and uses the harmonic mean of answer scores across both perspectives as a pseudo-label selection criterion, replacing conventional majority voting. The approach achieves state-of-the-art performance in 28 out of 30 experimental settings with zero training failures.
Self-Improving Skill Learning for Robust Skill-based Meta-Reinforcement Learning: This paper proposes SISL (Self-Improving Skill Learning), which decouples the high-level exploitation policy from a dedicated skill improvement policy, and incorporates a maximum return relabeling mechanism for skill prioritization. SISL achieves robust skill learning under noisy offline demonstration data and substantially improves the performance of skill-based meta-reinforcement learning on long-horizon tasks.
Shop-R1: Rewarding LLMs to Simulate Human Behavior in Online Shopping via Reinforcement Learning: This paper proposes Shop-R1, a framework that leverages hierarchical reward design and difficulty-aware reward scaling within a reinforcement learning paradigm to substantially improve LLMs' ability to simulate realistic human online shopping behavior, achieving over 65% improvement in exact action match compared to SFT baselines.
Single Index Bandits: Generalized Linear Contextual Bandits with Unknown Reward Functions: This paper introduces the Single Index Bandit (SIB) problem — extending generalized linear bandits to the setting where the reward function is unknown — and proposes a family of efficient algorithms (STOR/ESTOR/GSTOR) based on Stein's method, achieving near-optimal regret \(\tilde{O}(\sqrt{T})\) under monotone increasing reward functions.
Solving Football by Exploiting Equilibrium Structure of 2p0s Differential Games with One-Sided Information: This paper proves the atomic structure of Nash equilibrium strategies in two-player zero-sum differential games with one-sided information: the equilibrium strategy of the informed player P1 concentrates on at most \(I\) action prototypes (where \(I\) = number of game types), reducing game tree complexity from \(U^{2K}\) to \(I^K\). This enables an M1 MacBook to solve 11v11 American football with continuous action spaces (traditional complexity \(10^{440}\)) in 30 minutes.
Solving Parameter-Robust Avoid Problems with Unknown Feasibility using Reinforcement Learning: This paper proposes Feasibility-Guided Exploration (FGE), which simultaneously identifies the feasible parameter subset and learns a safe policy over that subset, addressing parameter-robust avoid problems with unknown feasibility. FGE covers more than 50% additional safe parameters compared to the best existing methods on MuJoCo tasks.
Spectral Bellman Method: Unifying Representation and Exploration in RL: This paper proposes the Spectral Bellman Method (SBM), which derives a spectral relationship between the Bellman operator and feature covariance structure from the zero Intrinsic Bellman Error (IBE) condition, leading to a novel representation learning objective that naturally unifies representation learning and Thompson Sampling–based exploration.
SPELL: Self-Play Reinforcement Learning for Evolving Long-Context Language Models: This paper proposes the SPELL framework, in which a single LLM simultaneously assumes three roles—question generator, responder, and verifier—engaging in self-play reinforcement learning without human annotation to continuously improve long-context reasoning, achieving consistent performance gains across 6 long-context benchmarks.
SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning: This paper proposes SPIRAL, a framework that trains LLMs via self-play in multi-turn zero-sum games. Through Role-conditioned Advantage Estimation (RAE) to stabilize training, SPIRAL improves reasoning performance by up to 10% without domain-specific data, and reveals that different games cultivate complementary cognitive abilities.
Spotlight on Token Perception for Multimodal Reinforcement Learning: This paper proposes VPPO (Visually-Perceptive Policy Optimization), which quantifies the visual dependency of each token and refines learning signals at both the trajectory level and the token level, significantly enhancing the multimodal reasoning capabilities of large vision-language models.
Stackelberg Coupling of Online Representation Learning and Reinforcement Learning: This paper proposes SCORER, a framework that models representation learning and value function learning in Deep Q-Learning as a Stackelberg game. Through two-timescale updates—where the Q-network acts as the slow-updating leader and the encoder as the fast-updating follower—SCORER achieves stable co-adaptation without modifying the network architecture.
Stop Unnecessary Reflection: Training LRMs for Efficient Reasoning with Adaptive Reflection and Length Coordinated Penalty: This paper proposes ARLCP (Adaptive Reflection and Length Coordinated Penalty), an adaptive reinforcement learning method that dynamically adjusts the weights of reflection and length penalties according to problem complexity, substantially reducing token consumption while maintaining or improving accuracy.
Strict Subgoal Execution: Reliable Long-Horizon Planning in Hierarchical Reinforcement Learning: This paper proposes SSE (Strict Subgoal Execution), a framework that strictly distinguishes between successful and failed subgoal reaching via Frontier Experience Replay (FER), combined with a decoupled exploration policy and failure-aware path optimization. By enforcing subgoal completion within each high-level step, SSE substantially reduces the number of high-level decisions and improves success rates on long-horizon tasks.
SUSD: Structured Unsupervised Skill Discovery through State Factorization: This paper proposes SUSD (Structured Unsupervised Skill Discovery), which factorizes the state space into independent factors and assigns dedicated skill variables to each factor. Combined with a curiosity-driven factor-weighting mechanism, SUSD discovers diverse skills that cover all controllable factors in complex multi-object and multi-agent environments.
\(\textbf{Re}^{2}\): Unlocking LLM Reasoning via Reinforcement Learning with Re-solving: This paper proposes Re², a pure reinforcement learning method that trains LLMs to actively abandon unproductive reasoning chains and restart the solving process during inference. The approach amplifies the rare redo behavior from ~0.5% to over 30%, achieving significant improvements over standard RLVR methods under the same training compute budget.
The Sample Complexity of Online Reinforcement Learning: A Multi-Model Perspective: This paper proposes an online reinforcement learning algorithm for nonlinear dynamical systems with continuous state-action spaces. By combining multi-model posterior sampling with certainty-equivalence control, the algorithm enables online learning of unknown systems and provides non-asymptotic policy regret guarantees that scale from finite model sets to parametric model families.
Thermodynamics of Reinforcement Learning Curricula: This paper formalizes curriculum learning in RL as a geodesic optimization problem over task space using a framework of excess work minimization drawn from non-equilibrium thermodynamics, and derives the MEW temperature annealing algorithm based on a friction tensor, outperforming standard SAC temperature scheduling on the MuJoCo Humanoid task.
Thinking on the Fly: Test-Time Reasoning Enhancement via Latent Thought Policy Optimization: This paper proposes Latent Thought Policy Optimization (LTPO), a test-time reasoning enhancement framework that requires no model parameter updates. By treating intermediate latent "thought" vectors as dynamically optimizable variables, LTPO leverages online policy gradient methods and intrinsic confidence reward signals to enhance the reasoning capability of frozen LLMs.
Toward a Dynamic Stackelberg Game-Theoretic Framework for Agent-Based Conversational AI Defense Against LLM Jailbreaking: This paper formalizes LLM jailbreaking attack-defense interactions as a dynamic Stackelberg extensive-form game, integrates Rapidly-exploring Random Tree (RRT) search over the prompt space, and proposes the Purple Agent defense architecture that achieves proactive defense through "red-team thinking, blue-team action."
Towards Bridging the Gap between Large-Scale Pretraining and Efficient Finetuning for Humanoid Control: LIFT proposes a three-stage pretraining-finetuning framework: (i) large-scale parallel SAC pretraining for zero-shot deployment; (ii) offline pretraining of a physics-prior world model based on Lagrangian dynamics; (iii) efficient finetuning via deterministic action execution in the environment combined with stochastic exploration within the world model. The full sim-to-real pipeline is validated on Booster T1 and Unitree G1 humanoid robots.
Towards Strategic Persuasion with Language Models: Grounded in the Bayesian Persuasion framework, this paper proposes a systematic methodology for evaluating and training the strategic persuasion capabilities of LLMs. It finds that frontier models already exhibit significant strategic persuasion ability, and that even small LLMs can substantially improve their persuasive performance through reinforcement learning.
TPRU: Advancing Temporal and Procedural Understanding in Large Multimodal Models: TPRU constructs a large-scale multi-image temporal understanding dataset (24,750 QA pairs, 126,000 images) spanning 3 complementary task types (temporal ordering, next-frame prediction, previous-frame review) across 4 embodied scenarios including robotic manipulation and GUI navigation, and demonstrates that RL fine-tuning enables a 7B model to surpass GPT-4o on temporal understanding benchmarks.
TRACED: Transition-aware Regret Approximation with Co-learnability for Environment Design: TRACED improves regret approximation in Unsupervised Environment Design (UED) by augmenting the conventional PVL with an Approximate Transition Prediction Loss (ATPL) to capture dynamics model mismatch, and introduces a Co-Learnability measure to quantify inter-task transfer benefits. On MiniGrid and BipedalWalker, TRACED surpasses all baselines' 20k-update performance using only 10k updates.
Transitive RL: Value Learning via Divide and Conquer: This paper proposes Transitive Reinforcement Learning (TRL), a novel value function learning algorithm based on the divide-and-conquer paradigm. By exploiting the triangle inequality structure inherent in goal-conditioned RL, TRL recursively decomposes value function updates into subproblems, achieving superior performance over TD learning and Monte Carlo methods on long-horizon tasks.
Trinity: An Evolved LLM Coordinator: Trinity introduces a lightweight coordinator (0.6B SLM + ~10K trainable parameters in a linear head) optimized via sep-CMA-ES. In multi-turn dialogues, the coordinator routes queries to different LLMs and assigns one of three roles—Thinker, Worker, or Verifier—achieving 86.2% pass@1 SOTA on LiveCodeBench and consistently outperforming all single-model and multi-agent baselines across 4 in-distribution and 4 out-of-distribution tasks.
TROLL: Trust Regions improve Reinforcement Learning for Large Language Models: This paper proposes TROLL (Trust Region Optimization for Large Language models), which replaces the clipping mechanism in PPO with a differentiable discrete trust-region projection, enabling token-level policy updates under principled KL constraints. TROLL consistently outperforms PPO-clip on mathematical reasoning and code generation tasks.
UME-R1: Exploring Reasoning-Driven Generative Multimodal Embeddings: This paper proposes UME-R1, the first framework to explore a reasoning-driven generative multimodal embedding paradigm. Through a two-stage training pipeline (cold-start SFT followed by reinforcement learning), the embedding model learns to reason before generating representations, achieving significant improvements over traditional discriminative embedding models across 78 tasks on the MMEB-V2 benchmark.
Understanding and Improving Hyperbolic Deep Reinforcement Learning: Through closed-form gradient analysis, this paper identifies the root causes of instability in hyperbolic deep RL—namely, conformal factor explosion in the Poincaré Ball and PPO trust-region breakdown induced by large-norm embeddings. It proposes Hyper++, a four-component solution comprising RMSNorm, learnable scaling, HL-Gauss categorical value loss, and the Hyperboloid model, achieving comprehensive improvements over prior baselines on ProcGen (16 environments) and Atari-5.
unsupervised learning of efficient exploration pre-training adaptive policies vi: This paper proposes ULEE, an unsupervised meta-learning method that trains adaptive policies via adversarially self-generated goal curricula, achieving efficient exploration and few-shot adaptation on the XLand-MiniGrid benchmark.
Unveiling the Cognitive Compass: Theory-of-Mind-Guided Multimodal Emotion Reasoning: This paper constructs HitEmotion, a hierarchical multimodal emotion understanding benchmark grounded in Theory of Mind (ToM), and proposes the TMPO framework, which leverages intermediate mental states as process-level supervision to enhance the emotion reasoning capabilities of MLLMs.
Value Flows: Value Flows is the first work to introduce flow matching into distributional RL — it learns a vector field such that the induced probability density path automatically satisfies the distributional Bellman equation. Variance of the return distribution is efficiently estimated via a flow derivative ODE, enabling confidence-weighted prioritized learning. The method achieves an average 1.3× improvement in success rate across 62 OGBench tasks, and estimates return distributions 3× more accurately than C51/CODAC.
VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models: This work introduces VerifyBench and VerifyBench-Hard, two evaluation benchmarks targeting reference-based reward systems widely used in training Large Reasoning Models (LRMs). Through rigorous human annotation, the benchmarks assess the accuracy of various verification systems and reveal that even the strongest models achieve only approximately 88% accuracy on hard samples, exposing substantial room for improvement in current verification systems.
Virne: A Comprehensive Benchmark for RL-based Network Resource Allocation in NFV: This paper proposes Virne — a comprehensive benchmark framework for Network Function Virtualization Resource Allocation (NFV-RA) — integrating 30+ algorithms and a gym-style environment to support systematic evaluation across cloud, edge, 5G, and other scenarios.
Whatever Remains Must Be True: Filtering Drives Reasoning in LLMs, Shaping Diversity: This paper proposes the DMVR framework and the α-DPG algorithm. By explicitly defining a target distribution that "filters out incorrect answers" and approximating it via the α-divergence family, the work unifies RLVR (Reverse KL) and rejection sampling fine-tuning (Forward KL), achieving Pareto-optimal performance on the accuracy–coverage frontier for Lean theorem proving.
When Sensors Fail: Temporal Sequence Models for Robust PPO under Sensor Drift: This paper investigates the robustness of PPO under temporally persistent sensor failures, proposes integrating sequence models (Transformer and SSMs) into PPO, derives high-probability upper bounds on infinite-horizon reward degradation under stochastic sensor failures, and demonstrates through MuJoCo experiments that Transformer-PPO significantly outperforms MLP, RNN, and SSM baselines under severe sensor dropout.
WIMLE: Uncertainty-Aware World Models with IMLE for Sample-Efficient Continuous Control: WIMLE extends Implicit Maximum Likelihood Estimation (IMLE) to model-based RL, learning stochastic world models capable of capturing multimodal transition dynamics. Predictive uncertainty is estimated via ensemble and latent sampling, and is used to weight the RL objective on synthetic data. Across 40 continuous control tasks, WIMLE achieves superior sample efficiency and asymptotic performance compared to strong model-free and model-based baselines.