🎮 Reinforcement Learning¶
🧪 ICML2025 · 69 paper notes
📌 Same area in other venues: 📷 CVPR2026 (25) · 🔬 ICLR2026 (400) · 💬 ACL2026 (46) · 🧪 ICML2026 (110) · 🤖 AAAI2026 (58) · 🧠 NeurIPS2025 (143)
🔥 Top topics: Reinforcement Learning ×21 · Agents ×7 · Adversarial Robustness ×6 · Reasoning ×4 · Few-/Zero-Shot Learning ×2
- A Theoretical Study of (Hyper) Self-Attention through the Lens of Interactions: Representation, Training, Generalization
-
From the unified perspective of "interacting entities", this paper proves that a single-layer linear self-attention can efficiently represent, learn, and generalize pairwise interaction functions with \(\Theta(|\mathcal{S}|^2)\) parameters (whereas fully connected networks require \(\Omega(L^2|\mathcal{S}|^2)\)). Based on this theoretical insight, two new modules, HyperFeatureAttention (feature-level interaction coupling) and HyperAttention (higher-order multi-entity interactions), are proposed, which reduce perplexity in language modeling.
- Action-Dependent Optimality-Preserving Reward Shaping (ADOPS)
-
The ADOPS method is proposed. By querying the extrinsic/intrinsic value function estimates from the critic network, it adjusts rewards only when the intrinsic reward would change the preference for the optimal action. This achieves action-dependent optimality-preserving reward shaping, overcoming the limitation of PBRS which can only handle action-independent forms, and outperforms all previous optimality-preserving methods and the RND baseline on Montezuma's Revenge.
- Actor-Critics Can Achieve Optimal Sample Efficiency
-
This paper is the first to prove that Actor-Critic algorithms can achieve an optimal sample complexity of \(O(1/\epsilon^2)\) under general function approximation and strategic exploration. This is achieved by integrating optimistic exploration, off-policy critic estimation, and rare-switching policy resets, and the results are further extended to the hybrid RL setting.
- Adversarial Cooperative Rationalization: The Risk of Spurious Correlations in Even Clean Datasets
-
Reveals a hidden flaw in the cooperative rationalization framework (RNP)—even on clean datasets, the generator's sampling bias introduces spurious correlations between rationales and labels. An adversarial detection and instruction intervention method is proposed, significantly outperforming existing methods on text and graph classification.
- Automatic Reward Shaping from Confounded Offline Data
-
Proposes the first theoretically guaranteed data-driven method to automatically learn potential-based reward shaping (PBRS) functions from offline data contains unobserved confounders. The method uses the causal Bellman optimality equation to upper bound the optimal state value as the potential function, and proves that the resulting Q-UCB Shaping algorithm enjoys a superior gap-dependent regret bound compared to vanilla Q-UCB on pseudo-suboptimal state-action pairs.
- BEAVER: Building Environments with Assessable Variation for Evaluating Multi-Objective Reinforcement Learning
-
Proposes the BEAVER benchmark, the first multi-objective contextual reinforcement learning evaluation framework for building energy management. By parameterizing thermal dynamics and climate zones, it constructs controllable environmental variations to systematically evaluate the cross-environment generalization capabilities of existing MORL algorithms.
- Benchmarking Quantum Reinforcement Learning
-
Proposes a rigorous benchmarking methodology for quantum reinforcement learning (QRL)—introducing a statistical estimator based on sample complexity and the concept of "surpassing" defined by statistical significance. Conducts the largest-scale (100 seeds) comparison of QRL vs. classical RL to date on a newly designed 6G beam management environment, revealing that prior claims regarding QRL superiority need to be treated with greater caution.
- Beyond The Rainbow: High Performance Deep Reinforcement Learning on a Desktop PC
-
BTR (Beyond The Rainbow) is proposed—integrating 6 RL improvements into Rainbow DQN to train on Atari-60 to an IQM of 7.4 (compared to 1.9 for Rainbow) within 12 hours on a single desktop PC, and successfully training agents to play 3D games like Super Mario Galaxy, Mario Kart, and Mortal Kombat for the first time.
- BRITE: Bootstrapping Reinforced Thinking Process to Enhance Language Model Reasoning
-
BRITE is proposed to iteratively collect and reinforce the intermediate thinking processes of LLMs via bootstrapping, combining process-level reward models and PPO training to continuously enhance LLM performance on tasks such as mathematical reasoning.
- Conceptual Belief-Informed Reinforcement Learning
-
Proposes HI-RL (Human Intelligence-RL), which integrates conceptual abstraction and probabilistic prior belief mechanisms from cognitive science into RL. It extracts high-level concepts from experience and constructs concept-associated adaptive priors to guide value function/policy updates, consistently improving the sample efficiency of DQN/PPO/SAC/TD3 as an algorithm-agnostic plug-in.
- Continual Reinforcement Learning by Planning with Online World Models
-
This paper proposes the FTL Online Agent (OA), which achieves continual reinforcement learning through an online-learned Follow-The-Leader shallow world model combined with Model Predictive Control (MPC) planning. This world model is immune to catastrophic forgetting by construction, provides a theoretical regret bound guarantee of \(\mathcal{O}(\sqrt{K^2 D \log(T)})\), and comprehensively outperforms deep world model-based methods on a specially designed Continual Bench.
- Controlling Underestimation Bias in Constrained Reinforcement Learning for Safe Exploration
-
Proposes MICE (Memory-driven Intrinsic Cost Estimation)—a method that stores historical high-cost states through a flashbulb memory mechanism and constructs intrinsic cost signals to correct the underestimation bias of the cost value function, significantly reducing constraint violations during the training of constrained RL.
- Counterfactual Effect Decomposition in Multi-Agent Sequential Decision Making
-
This paper proposes a bi-level causal decomposition framework that systematically decomposes the Total Counterfactual Effect (TCFE) of an action in multi-agent sequential decision-making into the "effect propagated through agent behavior" (tot-ASE) and the "effect propagated through state transitions" (r-SSE), and further attributes them to individual agents and state variables using Shapley values and Intrinsic Causal Contribution (ICC), respectively.
- Decoding Rewards in Competitive Games: Inverse Game Theory with Entropy Regularization
-
A unified framework for inverse problems in zero-sum games based on entropy regularization is proposed. Under linear assumptions, the identifiability conditions of reward functions are established using Quantal Response Equilibrium (QRE). An algorithm for constructing confidence sets is provided to recover reward functions from observed actions, with a guaranteed convergence rate of \(\mathcal{O}(T^{-1/2})\).
- Demystifying the Paradox of Importance Sampling with an Estimated History-Dependent Behavior Policy in Off-Policy Evaluation
-
This work theoretically demystifies the fundamental reason behind the paradox that "using an estimated history-dependent behavior policy in OPE is paradoxically better than using the true behavior policy"—estimating the behavior policy implicitly projects the IS estimator onto a more constrained space, reducing asymptotic variance at the cost of increasing finite-sample bias.
- Divide and Conquer: Grounding LLMs as Efficient Decision-Making Agents via Offline Hierarchical Reinforcement Learning
-
GLIDER introduces a parameter-efficient hierarchical structure where a high-level policy learns abstract step-by-step plans to guide a low-level controller. By decomposing complex long-horizon decision-making into coherent Chain-of-Thought (CoT) reasoning subtasks via offline hierarchical RL, it achieves consistent performance improvements and stronger generalization capabilities on ScienceWorld and ALFWorld.
- Diving into Self-Evolving Training for Multimodal Reasoning
-
This paper revisits self-evolving training in multimodal reasoning from a reinforcement learning perspective, systematically analyzing three key factors: training methods, reward models, and prompt variations. It proposes an adaptive temperature adjustment mechanism based on Reward-Pass@K to alleviate training saturation, culminating in the M-STaR framework, which achieves consistent improvements across multiple benchmarks.
- Embedding Safety into RL: A New Take on Trust Region Methods
-
The C-TRPO algorithm is proposed, which modifies the geometry of the policy space (by embedding a constraint-aware barrier term into the KL divergence) so that the trust region naturally contains only safe policies. This guarantees constraint satisfaction throughout the entire training process while maintaining return performance comparable to SOTA.
- Enhancing Cooperative Multi-Agent Reinforcement Learning with State Modelling and Adversarial Exploration
-
Proposed the SMPE² algorithm, which learns meaningful state belief representations through variational inference and integrates adversarial intrinsic exploration. It significantly enhances coordination in partially observable cooperative multi-agent environments, outperforming SOTA on three benchmarks: MPE, LBF, and RWARE.
- Enhancing Decision-Making of Large Language Models via Actor-Critic
-
This work proposes the LAC (LLM-based Actor-Critic) framework, which constructs a Q-function (Critic) using the ratio of positive/negative outcome probabilities from token logits and achieves gradient-free policy optimization (Actor) via a closed-form solution with a KL constraint. It outperforms GPT-4 + ReAct on three benchmarks (ALFWorld, BabyAI-Text, WebShop) using 7B/8B models.
- Ergodic Generative Flows
-
This paper proposes Ergodic Generative Flows (EGFs), which construct generative flows via a finite set of global diffeomorphisms. By leveraging ergodicity, EGFs guarantee universality. A novel KL-weakFM loss is designed to enable imitation learning without requiring an independent reward model. EGFs outperform baselines on NASA Earth science datasets with a model 30 times smaller.
- Exploring Large Action Sets with Hyperspherical Embeddings using von Mises-Fisher Sampling
-
Ours proposes vMF-exp, which achieves scalable exploration over large-scale action sets (million-scale) by sampling von Mises-Fisher distributed vectors on the hypersphere and then performing nearest-neighbor retrieval. It is theoretically proven to be asymptotically equivalent to Boltzmann exploration under the uniform distribution assumption, and has been successfully deployed in the Deezer music recommendation system.
- Extreme Value Policy Optimization for Safe Reinforcement Learning
-
The EVO algorithm is proposed to introduce Extreme Value Theory (EVT) into constrained reinforcement learning. It models extreme samples in the tail of the cost distribution using the Generalized Pareto Distribution (GPD) and designs extreme quantile constraints along with an extreme prioritization replay mechanism, achieving zero constraint violations during training while maintaining competitive policy performance.
- Fast and Robust: Task Sampling with Posterior and Diversity Synergies for Adaptive Decision-Makers in Randomized Environments
-
Proposes PDTS (Posterior and Diversity Synergized Task Sampling), modeling robust active task sampling as an infinite-armed bandit problem. By replacing UCB with posterior sampling and introducing diversity regularization, it minimalistly achieves near-worst-case robust adaptation performance in Domain Randomization and Meta-RL.
- Graph-Supported Dynamic Algorithm Configuration for Multi-Objective Combinatorial Optimization
-
This paper proposes GS-MODAC, which leverages GNNs to map solutions in the objective space into a graph structure for learning state representations. Combined with PPO, it dynamically configures the parameters of Multi-Objective Evolutionary Algorithms (MOEAs). It outperforms static and existing DRL methods on scheduling and routing NP-hard combinatorial optimization problems, demonstrating generalization capability across problem scales and numbers of objectives.
- Heterogeneous Data Game: Characterizing the Model Competition Across Multiple Data Sources
-
This paper proposes the Heterogeneous Data Game (HD-Game) framework, applying game theory to analyze the competitive behavior of multiple ML model providers over heterogeneous data sources. It uncovers three pure strategy Nash equilibrium (PNE) patterns—non-existence, homogenization, and heterogenization—and provides sufficient/necessary conditions for the existence of each type.
- Hierarchical Reinforcement Learning with Targeted Causal Interventions
-
This paper proposes the HRC framework, which models the relationships between subgoals in hierarchical reinforcement learning as a causal graph. It learns the subgoal structure using a causal discovery algorithm and performs targeted interventions based on causal effect prioritization, significantly reducing the training cost for long-horizon sparse reward tasks.
- KEA: Keeping Exploration Alive by Proactively Coordinating Exploration Strategies
-
This paper proposes KEA, which actively coordinates different exploration strategies by introducing a dynamic switching mechanism between a standard agent and a novelty-augmented agent. This resolves the issues of redundant sampling and inefficient exploration caused by policy interaction when combining SAC with novelty-based exploration.
- Learning Mean Field Control on Sparse Graphs
-
The paper proposes the Local Weak Mean Field Control (LWMFC) framework, which leverages local weak convergence theory to extend mean field control to extremely sparse graphs with a power-law exponent of \(\gamma > 2\). Combined with a two-systems approximation and scalable RL algorithms, this method significantly outperforms Lp-graphon and graphex-based methods on both synthetic and real-world networks.
- Learning Progress Driven Multi-Agent Curriculum
-
SPMARL is proposed to drive adaptive curriculum distributions over the number of agents using a TD-error-based learning progress (instead of returns), addressing the issues of high variance in return estimation and credit assignment difficulty in multi-agent sparse reward tasks.
- Learning to Incentivize in Repeated Principal-Agent Problems with Adversarial Agent Arrivals
-
This paper is the first to study repeated principal-agent problems with adversarial agent arrivals. It provides tight upper and lower regret bounds under both greedy and smooth response models, with the core idea of reducing the incentive design problem to adversarial linear bandits.
- Learning to Trust Bellman Updates: Selective State-Adaptive Regularization for Offline RL
-
Proposes Selective State-Adaptive Regularization (SSAR), which dynamically generates individual regularization coefficients for each state using a neural network and enforces constraints strictly on high-quality actions. This framework unifies the two major offline RL paradigms: CQL (value regularization) and TD3+BC (policy constraint), achieving massive performance gains over baselines on both offline and O2O scenarios in D4RL.
- Learning Utilities from Demonstrations in Markov Decision Processes
-
This paper introduces the Utility Learning (UL) problem to capture agents' risk attitudes by inferring their utility functions from demonstrations, proposes two provably efficient algorithms, and analyzes their sample complexity and identifiability.
- Leveraging Skills from Unlabeled Prior Data for Efficient Online Exploration
-
This paper proposes SUPE, a method that uses unlabeled offline trajectory data "twice"—both for VAE skill pre-training and as high-level off-policy data via UCB pseudo-labels to accelerate online exploration, comprehensively outperforming prior methods on 42 sparse-reward tasks.
- LineFlow: A Framework to Learn Active Control of Production Lines
-
Proposes LineFlow, an extensible, open-source Python framework designed to simulate production lines of arbitrary complexity and train RL agents for active production line control (adaptive routing, worker reallocation, dispatching, etc.), while providing mathematical optimal solutions for several sub-problems as benchmarks.
- Log-Sum-Exponential Estimator for Off-Policy Evaluation and Learning
-
A novel non-linear estimator based on the log-sum-exponential (LSE) operator is proposed for off-policy evaluation and learning, which significantly reduces variance and provides theoretical guarantees under heavy-tailed rewards and noisy propensity score scenarios.
- Mastering Massive Multi-Task Reinforcement Learning via Mixture-of-Expert Decision Transformer
-
This paper proposes the M3DT framework, which integrates Mixture-of-Experts (MoE) into the Decision Transformer to achieve parameter decoupling. By grouping tasks, each expert only learns task-specific knowledge for a small subset of tasks. Coupled with a three-stage training mechanism (backbone \(\rightarrow\) experts \(\rightarrow\) router) to prevent gradient conflict, increasing the number of experts simultaneously scales up parameters and reduces task load, successfully scaling offline multi-task RL to 160 simulated control tasks.
- Meta-Black-Box-Optimization through Offline Q-function Learning (Q-Mamba)
-
This paper proposes Q-Mamba, the first offline MetaBBO framework. By integrating Q-function decomposition, conservative Q-learning, and the Mamba architecture, it achieves comparable or superior BBO algorithm configuration performance with less than half the training budget of online methods.
- Mitigating Plasticity Loss in Continual Reinforcement Learning by Reducing Churn
-
This work establishes a causal relationship between plasticity loss and churn (out-of-batch output drift) through the NTK matrix, and proposes the C-CHAIN method to continuously suppress churn during continual RL training. This mitigates plasticity loss and outperforms existing baselines across 24 continual RL environments.
- Network Sparsity Unlocks the Scaling Potential of Deep Reinforcement Learning
-
This work discovers that simple one-shot random pruning can unlock the scaling potential of deep RL—sparse networks achieve higher parameter efficiency, stronger plasticity preservation, and less gradient interference than dense networks equipped with SOTA architectures.
- Non-stationary Online Learning for Curved Losses: Improved Dynamic Regret via Mixability
-
By replacing traditional KKT analysis with the concept of mixability, this paper proposes a continuous-space online learning framework based on exponential weights and fixed-share updates, significantly improving the dependence of dynamic regret on dimension \(d\) from \(O(d^{10/3})\) to \(O(d)\) for curved loss functions (such as squared/logistic loss).
- On the Dynamic Regret of Following the Regularized Leader: Optimism with History Pruning
-
This paper proposes the OptFPRL algorithm, which introduces a history gradient pruning mechanism into the Follow the Regularized Leader (FTRL) framework. This establishes the first data-dependent dynamic regret guarantee for FTRL on compact sets, where the dynamic regret is fully controlled by the prediction error and can reach zero regret under perfect predictions.
- Online Pre-Training for Offline-to-Online Reinforcement Learning
-
This work proposes the OPT method, which introduces an "online pre-training" phase between offline pre-training and online fine-tuning. By introducing an independent value function trained with a meta-adaptation objective, OPT addresses the performance degradation in online fine-tuning caused by inaccurate value estimation of offline pre-trained agents, achieving an average improvement of around 30% on the D4RL benchmark.
- Optimal and Practical Batched Linear Bandit Algorithm
-
By deeply integrating the arm elimination strategy with regularized G-optimal design, BLAE is the first to simultaneously achieve minimax optimal regret (up to a \(\log T\) factor) under both large-\(K\) and small-\(K\) regimes in the batched linear bandit problem, while maintaining the minimal batch complexity of \(\mathcal{O}(\log\log T)\) and outstanding practical performance.
- Optimizing Language Models for Inference Time Objectives using Reinforcement Learning
-
This paper proposes explicitly optimizing inference-time k-sample objectives (pass@k / majority voting) during the RL training phase. By constructing unbiased, low-variance gradient estimators using a leave-one-out control variate, the approach significantly improves inference-time performance on MATH and CodeContests.
- Pessimism Principle Can Be Effective: Towards a Framework for Zero-Shot Transfer RL
-
This paper proposes a transfer RL framework based on the pessimism principle. By constructing a conservative lower bound of target domain performance as an optimization surrogate using robust MDPs, the authors design two surrogates, Averaged Operator and Minimal Pessimism, along with distributed algorithms, to ensure safe transfer and avoid negative transfer.
- PIGDreamer: Privileged Information Guided World Models for Safe Partially Observable RL
-
This paper proposes the ACPOMDPs theoretical framework and constructs PIGDreamer, which leverages privileged information (e.g., ground-truth states, sensory data) during the training phase through representation alignment, a privileged predictor, and an asymmetric critic to enhance world model-based safe RL. It achieves a 136% performance improvement with only 28% additional training time in partially observable environments.
- Position: Lifetime Tuning is Incompatible with Continual Reinforcement Learning
-
This position paper identifies a critical methodological flaw in continual reinforcement learning (RL) research: "lifetime tuning" (hyperparameter tuning over the entire agent lifetime) masks the true continual learning capability of algorithms. It proposes k%-percent tuning as a more reasonable alternative for evaluation.
- Preference Optimization for Combinatorial Optimization Problems
-
Introduces the concept of preference optimization from RLHF into Combinatorial Optimization Problems (COPs), transforming quantitative reward signals into qualitative preference signals. Combined with an entropy-regularized objective and local search fine-tuning, the proposed approach achieves a 1.5x-2.5x convergence acceleration and superior solution quality across standard benchmarks such as TSP, CVRP, and FFSP.
- Principal-Agent Bandit Games with Self-Interested and Exploratory Learning Agents
-
This paper investigates repeated principal-agent bandit games where agents make decisions based on empirical means (instead of known true means) and potentially explore randomly. It designs incentive algorithms for the principal that achieve regret bounds of \(\tilde{O}(\sqrt{T})\) or \(\tilde{O}(T^{2/3})\), significantly improving upon the prior \(\tilde{O}(T^{11/12})\) results.
- ReVISE: Learning to Refine at Test-Time via Intrinsic Self-Verification
-
This paper proposes the ReVISE framework, which introduces a special token
[refine]and a two-stage curriculum learning scheme (first learning self-verification, then learning self-correction). This enables LLMs to introspectively verify and correct their own reasoning trajectories at test-time without requiring external verifiers or complex RL training. - Reward-free World Models for Online Imitation Learning
-
Proposes IQ-MPC, a reward-free world model online imitation learning method that jointly learns the dynamics model and Q-function in latent space via inverse soft-Q learning, achieving stable expert-level imitation in high-dimensional observation and complex dynamics tasks using MPPI planning.
- Robust Noise Attenuation via Adaptive Pooling of Transformer Outputs
-
This paper formalizes the pooling operations of Transformer outputs as a vector quantization problem, demonstrates that AvgPool and MaxPool suffer from performance collapse when the signal-to-noise ratio (SNR) varies, and proposes an adaptive pooling method based on cross-attention (AdaPool). AdaPool is theoretically shown to approximate the signal-optimal quantizer under any SNR and exhibits superior robustness across RL, relational reasoning, and vision tasks.
- Robust Offline Reinforcement Learning with Linearly Structured f-Divergence Regularization
-
Proposes the d-rectangular linear RRMDP (d-RRMDP) framework, incorporating latent linear structures into both the transition kernel and the f-divergence regularization. It designs the R2PVI algorithm to learn robust policies from offline data, proves an instance-dependent suboptimality upper bound, and validates its near-optimality through an information-theoretic lower bound.
- Safety Certificate against Latent Variables with Partially Unidentifiable Dynamics
-
This paper proposes a safety certificate design method based on invariance conditions in the probability space. It utilizes causal reinforcement learning to learn marginalized Q-functions from offline data with latent variables. This ensures long-term safety even when offline and online statistical distributions are inconsistent, and rigorously proves the persistent feasibility of safe actions.
- Scaling Value Iteration Networks to 5000 Layers for Extreme Long-Term Planning
-
This paper proposes Dynamic Transition VIN (DT-VIN), which enhances the representation capability of latent MDPs by introducing dynamic transition kernels and designs an adaptive highway loss to alleviate vanishing gradients. This successfully scales VIN to 5000 layers, enabling 1800-step long-term planning in \(100 \times 100\) mazes (compared to the original VIN which only supports 120-step planning in \(25 \times 25\) mazes).
- Sliding Puzzles Gym: A Scalable Benchmark for State Representation in Visual Reinforcement Learning
-
This paper proposes Sliding Puzzles Gym (SPGym), a benchmark that transforms the classic 8-puzzle into visual RL tasks. By independently adjusting the image pool size, it precisely controls the complexity of visual representation learning. Experiments reveal fundamental memorization limitations of current methods as visual diversity increases.
- Solving Zero-Sum Convex Markov Games
-
This work provides the first theoretical guarantee of global convergence to Nash equilibrium for independent policy gradient methods in two-player zero-sum convex Markov games (cMGs). By utilizing non-convex regularization, the problem is reduced to a nonconvex-pPL min-max optimization, and nested/alternate policy gradient algorithms are designed.
- Stealing That Free Lunch: Exposing the Limits of Dyna-Style Reinforcement Learning
-
This paper reveals that Dyna-style model-based reinforcement learning algorithms (MBPO, ALM), despite performing exceptionally well on OpenAI Gym, fail catastrophically in the DeepMind Control Suite (DMC). Through a systematic analysis of factors including model error, overestimation bias, and loss of plasticity, it shows that even with a perfect model, MBPO cannot consistently outperform SAC, demonstrating that there is "no free lunch."
- Stochastic Encodings for Active Feature Acquisition
-
This paper proposes SEFA (Stochastic Encodings for Feature Acquisition), an active feature acquisition method based on stochastic latent variable models. By reasoning across multiple unobserved feature realizations in a regularized latent space instead of relying on RL and greedy CMI maximization, it consistently outperforms all baselines on synthetic and real-world datasets (including cancer classification).
- T1: Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling
-
T1 scales up open-source LLMs' reasoning performance by utilizing synthetic CoT data for initialization, combined with oversampling and entropy rewards during RL training to encourage exploration. This enables open-source models to exhibit inference-time scaling behavior, outperforming QwQ-32B-Preview on challenging mathematical reasoning benchmarks such as MATH500 and AIME2024.
- Test-Time Adaptation with Binary Feedback
-
This paper proposes BiTTA, a test-time adaptation framework utilizing binary feedback (correct/incorrect). Driven by a reinforcement learning-based dual-path optimization strategy, it achieves a 13.3% accuracy improvement under severe domain shift with minimal annotation cost.
- The Challenge of Teaching Reasoning to LLMs Without RL or Distillation
-
Lightweight fine-tuning of Qwen2.5-32B using only 20 long CoT examples from the reasoning model QwQ-32B-Preview outperforms a 72B math instruction model. However, CoTs generated by non-reasoning models or humans fail to achieve comparable effects, indicating a hard-to-replicate "latent quality" inherent in reasoning CoTs.
- LEAST: The Courage to Stop — Overcoming Sunk Cost Fallacy in Deep RL
-
Proposes Learn to Stop (LEAST), a lightweight adaptive episode early stopping mechanism: it maintains buffers of Q-values and gradient magnitudes for the most recent \(K\) episodes, and constructs a quality threshold \(\epsilon_i\) and a learning potential weight \(\omega_i\) using step-level medians. An episode is terminated and reset when the current Q-value is lower than \(\omega_i \times \epsilon_i\). It yields significant improvements for TD3, SAC, and REDQ across four MuJoCo tasks (improving normalized scores from 0.65 to over 0.70) and accelerates convergence by approximately 30% on the Finger Turn Hard task in DMC visual RL.
- The Impact of On-Policy Parallelized Data Collection on Deep Reinforcement Learning Networks
-
This work systematically investigates the impact of two dimensions of parallel data collection in on-policy RL (the number of parallel environments \(N_{\text{envs}}\) vs. rollout length \(N_{\text{RO}}\)) on PPO performance. It is found that under a fixed data budget, increasing the number of parallel environments is more effective than increasing the rollout length, and larger datasets improve network plasticity and optimization stability.
- The Sample Complexity of Online Strategic Decision Making with Information Asymmetry and Knowledge Transportability
-
In online reinforcement learning scenarios characterized by information asymmetry (where the agent has private types and private actions functioning as confounders) and requiring cross-distribution knowledge transportability, this paper proposes an algorithm named OPME based on nonparametric instrumental variables (NPIV). It proves that OPME achieves an \(\tilde{O}(1/\epsilon^2)\) sample complexity to learn an \(\epsilon\)-optimal policy, matching the corresponding lower bound.
- VinePPO: Refining Credit Assignment in RL Training of LLMs
-
VinePPO exploits the property that language environments can be reset from any intermediate state. It replaces the value network in PPO with Monte Carlo (MC) rollouts for unbiased value estimation. This approach outperforms the peak performance of PPO/GRPO/RLOO on mathematical reasoning tasks with less wall-clock time (up to 3x speedup) and exhibits a stronger generalization gradient.
- Wasserstein Policy Optimization
-
Wasserstein Policy Optimization (WPO) is proposed, which projects the Wasserstein gradient flow from optimal transport theory onto the parameter space. This yields a closed-form update rule that both enjoys the benefits of deterministic policy gradients (DPG) utilizing action-value gradients and supports arbitrary distributions like classic stochastic policy gradients (SPG), without requiring reparameterization tricks.
- Zero-Shot Generalization of Vision-Based RL Without Data Augmentation
-
Proposes ALDA (Associative Latent DisentAnglement), which achieves zero-shot generalization of visual RL in unseen environments through disentangled representation learning and an associative memory mechanism, performing comparably to methods using tens of millions of external data samples without requiring data augmentation.