ICML2026 Reinforcement Learning AI paper notes paper summaries Reasoning Agents LLM Diffusion Models Adversarial Robustness

🎮 Reinforcement Learning¶

🧪 ICML2026 · 110 paper notes

📌 Same area in other venues: 📷 CVPR2026 (25) · 🔬 ICLR2026 (400) · 💬 ACL2026 (46) · 🤖 AAAI2026 (58) · 🧠 NeurIPS2025 (143) · 📹 ICCV2025 (7)

🔥 Top topics: Reinforcement Learning ×47 · Reasoning ×15 · Agents ×11 · LLM ×10 · Diffusion Models ×5

Adaptive Bandit Algorithms for Contextual Matching Markets: This paper studies online matching markets with contexts, treating players' linear preferences for dynamic arm contexts as the bandit learning objective. It proposes BARB for stochastic contexts and AdECO for adversarial contexts, providing adaptive upper bounds for player-optimal stable regret and tight \(\tilde O(T^{2/3})\) theoretical results.
Agent Learning via Early Experience: This paper proposes the "early experience" paradigm, which allows language agents to utilize the future states of their own actions to learn environment dynamics and decision-making reflections without external rewards. This approach consistently outperforms pure imitation learning across 8 agent environments and provides a superior initialization for subsequent GRPO reinforcement learning.
ALSO: Adversarial Online Strategy Optimization for Social Agents: ALSO models dynamic strategy selection in LLM social intelligence simulations as an adversarial online bandit. It utilizes a lightweight reward surrogate model to generalize sparse feedback from dialogue history, improving the overall score on Sotopia-Hard from 3.02 to 3.53, with significant gains in the relationship dimension.
ASAP: Exploiting the Satisficing Generalization Edge in Neural Combinatorial Optimization: ASAP identifies that "identifying a set of promising actions" generalizes across distributions more easily than "directly selecting the single optimal action" in neural combinatorial optimization. It utilizes a two-stage proposal-selection strategy and MAML initialization to make neural solvers for 3D-BPP, TSP, and CVRP more stable and faster to adapt when distributions shift.
Beyond Scalar Rewards: Dense Feedback for LLM Policy Synthesis in Sequential Social Dilemmas: This paper proposes an iterative LLM policy synthesis framework where an LLM directly generates Python policy code for multi-agent sequential social dilemmas. Through "feedback engineering," it demonstrates that adding four social metrics—efficiency, equality, sustainability, and peace—as dense feedback alongside scalar rewards breaks the "feedback aliasing" problem, achieving up to a 54% efficiency improvement in the Cleanup game.
Beyond the Proxy: Trajectory-Distilled Guidance for Offline GFlowNet Training: The paper proposes TD-GFN, an offline GFlowNet training framework that eliminates the need for proxy reward models. It extracts edge-level rewards from offline trajectories via inverse reinforcement learning, followed by indirect policy guidance through DAG pruning and prioritized backward sampling. This approach ensures that gradient updates rely exclusively on ground-truth terminal rewards, significantly outperforming existing baselines in tasks such as molecular design and sequence generation.
Bilevel Optimization over Saddle Points of Zero-Sum Markov Games: The PANDA algorithm is proposed to solve bilevel RL problems where the lower level is a regularized zero-sum Markov game. By employing a penalty reformulation based on the Nikaido-Isoda function and utilizing purely first-order policy gradient methods, it achieves an iteration complexity of \(\tilde{O}(\epsilon^{-1})\) and a sample complexity of \(\tilde{O}(\epsilon^{-3})\), matching the best-known rates for single-policy lower-level BRL.
Break the Block: Dynamic-size Reasoning Blocks for Diffusion Large Language Models via Monotonic Entropy Descent with Reinforcement Learning: Addressing the issue where "fixed block sizes" break the logical chain of thought during semi-autoregressive generation in Diffusion Large Language Models (dLLM), this paper proposes b1. It learns a block-end indicator token via RL to generate dynamic-length blocks and employs a "block-level Monotonic Entropy Descent (MED) reward" to drive coherent reasoning. As a plug-and-play reward term integrated into existing dLLM RL frameworks (Diffu-GRPO/GDPO/d1/wd1), it improves wd1 performance on Countdown from 39.45 to 58.98.
CAMEL: Confidence-Gated Reflection for Reward Modeling: This paper observes that the log-probability margin of the verdict token is highly correlated with judgment accuracy. Based on this, it proposes CAMEL—a method that first provides a rapid preference judgment via a single token and triggers reflection generation only when confidence is low. Using counterfactual prefix augmentation in GRPO training to enhance self-correction capabilities, it achieves an average accuracy of 82.9% across three reward model benchmarks with 14B parameters (surpassing the previous best 70B model by 3.2%).
Can Large Language Models Generalize Procedures Across Representations?: This paper finds that procedural knowledge learned by LLMs on symbolic representations (code/graphs) cannot reliably transfer to natural language tasks. It proposes a two-stage RL curriculum strategy—"symbolic then natural language"—enabling a 1.5B Qwen model to approach zero-shot GPT-4o performance on asynchronous planning tasks. From a cognitive science perspective, it demonstrates that successful cross-representation generalization can be interpreted as generative analogy.
Chebyshev Policies and the Mountain Car Problem: Reinforcement Learning for Low-Dimensional Control Tasks: This paper provides the first analytical solution to the classic Mountain Car optimal control problem (unsolved for 36 years), revealing that the optimal policy has a minimalist form (\(\alpha = C \cdot \dot{x}\)). It demonstrates that existing RL agents exhibit surprisingly high regret and proposes a policy parameterization method based on multivariate Chebyshev polynomials, which reduces the number of parameters by 277x while decreasing regret by 4.18x.
COLLIE: Guiding Skill Discovery in Semantically Coherent Latent Space: This paper proposes COLLIE, a Guided Skill Discovery (GSD) framework that constructs a "semantically coherent" skill latent space (where close states share similar human desirability) using large-scale unlabeled data. This allows for the training-free propagation of a dense guidance signal \(w(s)\) from sparse human "good/bad" labels, directing unsupervised exploration towards safe and task-relevant regions without the need for additional guidance networks.
Compositional Transduction with Latent Analogies for Offline Goal-Conditioned Reinforcement Learning: This paper proposes CTA (Compositional Transduction with latent Analogies), which decomposes goal-reaching tasks into two independent factors: "task-intrinsic analogies" and "task-extrinsic contexts." By utilizing temporal distance difference fields as analogy representations and combining them with bilinear transduction, the method achieves extrapolation to unseen analogy-context combinations. Its average performance outperforms the strongest baseline by approximately 42% on OGBench manipulation environments.
Convergence of Steepest Descent and Adam under Non-Uniform Smoothness: This paper proposes \((H_0,H_1)\)-NS, a broader non-uniform smoothness than \((L_0,L_1)\)-NS by Zhang et al. Under this assumption and the (non-uniform) Łojasiewicz condition, it provides the first unified convergence rates for deterministic diagonal RMSProp / Adam and general Normalized Steepest Descent (Sign GD, Norm.GD, Sign CD-GS). It proves they are strictly faster than GD / AdaGrad / heavy-ball for logistic regression on separable data and softmax policy gradients.
Convergence of Two-Timescale Markovian Stochastic Approximations with Applications in Reinforcement Learning: This paper establishes the stability and almost sure (a.s.) convergence of general two-timescale stochastic approximation (SA) under Markovian noise without relying on any projection operators. Consequently, it provides the first a.s. convergence result for the TDC(\(\lambda\)) algorithm under off-policy linear function approximation.
Counterfactual Transport Flows for Offline Conservative Trajectory Refinement: Given a "low-return" candidate trajectory, this paper avoids re-generating actions from scratch. Instead, it retrieves "better" neighbors in the latent trajectory space as weak supervision, learns an "instance-specific" refinement direction using source-conditioned flow matching, and controls the degree of modification via a refinement intensity parameter \(\alpha\), enabling a continuous trade-off between "preserving original behavior" and "improving returns."
Coupled Variational Reinforcement Learning for Language Model General Reasoning: CoVRL reformulates verifier-free RL, which uses answer probabilities as rewards, as a variational inference problem. It constructs a composite distribution—"prior (question only) + posterior (question with answer)"—and optimizes both simultaneously via hybrid sampling and importance weighting. This approach improves Qwen2.5-7B by an average of 12.4% across 9 general and mathematical reasoning benchmarks compared to the base model, outperforming the strongest verifier-free baseline by an additional 2.3%.
CPMöbius: Iterative Coach–Player Reasoning for Data-Free Reinforcement Learning: The adversarial nature of self-play is replaced with "collaboration": a Coach generates problems, a Player solves them, and the Coach receives a reward based on "Player improvement \(\times\) Player success rate." Without any external training data, Qwen2.5-Math-7B-Instruct achieves an average score increase of +4.9 and an OOD gain of +5.4 across six math benchmarks, outperforming existing unsupervised methods like RENT and R-Zero.
CSPO: Constraint-Sensitive Policy Optimization for Safe Reinforcement Learning: Addressing the "dual-lag → delayed constraint correction → oscillation near boundaries" issue in primal-dual methods for Safe RL, CSPO incorporates the "shortest signed distance to the safety boundary" as a constraint-sensitive correction term into the policy update. It adaptively adjusts the correction intensity based on the constraint gradient norm, enabling faster and more stable returns to the feasible region without altering the KKT solution of the original problem.
d2: Improving Reasoning in Diffusion Language Models via Trajectory Likelihood Estimation: This paper proposes the d2 reinforcement learning framework for masked diffusion language models (masked DLM). The core contribution is the introduction of two "trajectory likelihood estimators": d2-AnyOrder, which provides exact single-forward estimates for any-order models, and d2-StepMerge, which provides adjustable-precision approximations for standard MDMs. This framework enables the correct implementation of GRPO, allowing LLaDA-8B-Instruct to achieve 91.9% / 56.6% / 85.0% / 41.6% on Sudoku/Countdown/GSM8K/MATH500 respectively, significantly outperforming diffusion RL baselines like d1 and wd1.
D\(^2\)Evo: Dual Difficulty-Aware Self-Evolution for Data-Efficient Reinforcement Learning: In each RL iteration, D\(^2\)Evo estimates difficulty using the current Solver, selects medium-difficulty real samples as anchors, and trains a Questioner to synthesize new problems of equivalent difficulty around these anchors. Consequently, it outperforms the GRPO baseline (trained on 19K real samples) in both mathematics and general reasoning using < 2K real math problems.
DARTS: Distribution-Aware Active Rollout Trajectory Shaping for Accelerating LLM Reinforcement Learning: DARTS redefines the long-tail bottleneck of LLM RL training rollouts from "scheduling circumvention" to "active distribution shaping." Through intra-prompt redundancy sampling + dual-end length sampling + variance-driven redundancy budget allocation, it explicitly shortens and tightens the rollout length distribution. Compared to VeRL, it achieves up to a 1.77× speedup on Qwen series 3B–32B models without sacrificing downstream accuracy.
Data- and Variance-dependent Regret Bounds for Online Tabular MDPs: For online episodic tabular MDPs with known transitions, this work designs a unified best-of-both-worlds algorithm based on optimistic follow-the-regularized-leader (OFTRL) with log-barrier. It provides first-order, second-order, and path-length data-dependent regret upper bounds in the adversarial regime, as well as variance-aware gap-independent and gap-dependent polylog bounds in the stochastic regime, complemented by matching lower bounds.
DR.Q: Debiased Model-based Representations for Sample-efficient Continuous Control: DR.Q builds upon the MR.Q framework ("model-based representation + actor-critic") by introducing two key components: explicitly maximizing the mutual information between \(z_{sa}\) and the next-state representation \(z_{s'}\) via InfoNCE, and mitigating early-experience overfitting with "faded prioritized replay" that fuses "PER × forget." It outperforms strong baselines like SimBaV2, MR.Q, and TDMPC2 across 73 continuous control tasks using a single set of hyperparameters.
Decoupling Skeleton and Flesh: Efficient Multimodal Table Reasoning with Disentangled Alignment and Structure-aware Guidance: This paper proposes a dual-component suite for multimodal table reasoning: during training, DiSCo decouples "skeleton" and "flesh" alignment targets via structure anonymization, allowing LVLMs to learn layouts with only 10K table images; during inference, Table-GLS compresses full-image QA into the minimal verifiable sub-table through a "global structure exploration \(\to\) self-refined sub-table extraction \(\to\) evidence-grounded reasoning" pipeline. This approach requires no specialized SFT for reasoning or external tools, outperforming SFT/RL baselines that rely on 82K-97K annotations across 21 benchmarks.
Direction-Conditioned Policies via Compositional Subgoal Scoring for Online Goal-Conditioned Reinforcement Learning: This paper proposes DCP (Direction-Conditioned Policies), which replaces the standard practice of the actor taking raw goal coordinates with a learned unit direction plus magnitude in representation space. By utilizing a scoring rule to select subgoals from historically visited states, the direction is stabilized during early training. DCP outperforms Contrastive RL (CRL) on most metrics across nine navigation and manipulation environments.
Distributional Inverse Reinforcement Learning: This paper proposes DistIRL: it models rewards as conditional distributions in offline Inverse Reinforcement Learning and upgrades the "expert is superior to the learner" constraint from expectation to First-order Stochastic Dominance (FSD). By relaxing the intractable 0/1 indicator function of FSD into an optimizable risk-weighted objective using Distortion Risk Measures (DRM), the framework systematically learns both complete reward distributions and distribution-aware policies from offline demonstrations for the first time.
Dr. Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research: Dr. Tulu proposes RLER (Reinforcement Learning with Evolving Rubrics), allowing evaluation rubrics to co-evolve with the policy during training. This extends RLVR from short-form QA to long-form deep research tasks with citations. Ultimately, DR Tulu-8B, trained from Qwen3-8B, outperforms Tongyi DR-30B by an average of 15.6 points across four long-form deep research benchmarks and reaches competitive performance with OpenAI Deep Research at a 1000x lower cost.
DRIVE: Distributional and Retrieval-Augmented Bidding with Value Evaluation: Addressing the issues of Decision Transformer (DT) methods in real-time bidding, specifically the "Average Action trap" (collapsing effective strategies into a mediocre action) and "erratic bidding in sparse long-tail traffic," DRIVE decouples candidate action generation from final decision-making. It employs a Gaussian Mixture Model (GMM) head to generate multimodal candidates, retrieves candidates from high-quality historical decisions, and uses an IQL value critic to score all candidates to select the optimal bid. DRIVE improves the average score on AuctionNet from 378.4 (strongest baseline) to 386.6 and can be integrated as a plug-and-play module into various DT-based methods.
EAPO: Enhancing Policy Optimization with On-Demand Expert Assistance: EAPO treats "consulting an external expert" as a learnable discrete action embedded in the policy space. This allows the LLM to call stronger models on-demand during the RL training phase to obtain intermediate hints. Through a gradually decaying acceptance rate, expert knowledge is internalized into the policy itself. During evaluation, the model performs independent reasoning and consistently out-performs pure self-exploratory RL on mathematical reasoning benchmarks such as AIME and AIMO.
EchoRL: Reinforcement Learning via Rollout Echoing: This paper identifies that in the late stages of RLVR training, GRPO-style methods suffer from "advantage degeneration"—where vanishing gradients occur because a group of rollouts all achieve success. The authors propose EchoRL: it identifies the "hardest yet successful" prefix, termed EchoClip, based on step-level entropy peaks from verified-success rollouts. This is added to the loss as an auxiliary SFT term, consistently delivering improvements of up to 5.6% ID and 5.0% OOD across 4 RLVR frameworks, 5 backbones, and 10 benchmarks.
FAB: A First-Order AB-based Gradient Algorithm for Distributed Bilevel Optimization over Time-Varying Directed Graphs: This paper proposes FAB—the first purely first-order algorithm for distributed bilevel optimization over time-varying directed graphs. By combining AB/Push-Pull communication with the value function penalty method, it achieves a non-asymptotic \(\mathcal{O}(K^{-2/3})\) convergence rate and simultaneously resolves the long-standing open problem regarding the convergence rate of AB/Push-Pull in non-convex scenarios over time-varying directed graphs.
Fast and Highly Expressive Policy Learning for Offline Reinforcement Learning via Bootstrapped Flow Q-Learning: Addressing the limitations of multi-step denoising and BPTT in Diffusion Q-Learning—which are slow and unstable—BFQ employs a divide-and-conquer bootstrapping approach for the "noise to action" displacement. By learning short-range displacements (precisely estimable via Flow Matching marginal velocity) and progressively assembling them into a single-step direct mapping, it enables single-step action generation in both training and inference without auxiliary networks, distillation, or multi-stage pipelines, significantly improving both performance and speed on D4RL.
Flow-Equivariant World Models: Memory for Partially Observed Dynamic Environments: FloWM maintains structured dynamic memory in latent space by leveraging time-parameterized symmetries (flow equivariance). This solves the problem of objects "disappearing" after moving out of bounds in partially observed environments, achieving long-horizon prediction accuracy far exceeding diffusion and recurrent baselines (SSIM 0.9525 vs. DFoT 0.8885 in 3D Block World 210-step prediction).
From Reward-Free Representations to Preferences: Rethinking Offline Preference-Based Reinforcement Learning: This paper reformulates Offline Preference-Based Reinforcement Learning (PbRL) within the Forward-Backward (FB) representation space. It proves that under the FB framework, the standard Bradley-Terry preference loss is equivalent to the SimCLR contrastive loss. Consequently, it proposes FB-PbRL: first pretraining FB representations on reward-free offline data, then using a contrastive objective on preference data to search for the task vector \(\boldsymbol{z}^\star\) and fine-tune the representations. The entire pipeline avoids training any explicit reward or preference models.
Game of Thought: Robust Information Seeking with Large Language Models Using Game Theory: This paper models active information seeking (e.g., 20 Questions, medical diagnosis, troubleshooting) as a two-player zero-sum Extensive-Form Game (EFG) and proposes Game of Thought (GoT). By using depth-limited subgame construction and applying Counterfactual Regret Minimization (CFR) to solve for Nash Equilibrium (NE), GoT generates "randomized questioning strategies." It significantly reduces worst-case interaction rounds across all datasets, with a 15–40% performance gain over UoT in weighted variants.
Global Policy-Space Response Oracles for Two-Player Zero-Sum Games: This paper points out that prevailing PSRO methods focus only on local information from the "restricted game" when expanding the policy population, leading to a worst-case requirement of nearly \(N\) pure policies for convergence. It proposes Global PSRO, a two-stage exploration-selection framework that first samples multiple candidate best responses and then selects the optimal expansion by directly scoring the post-expansion Population Exploitability (PE). The costs of multi-candidate training and evaluation are suppressed to acceptable levels through a parameter-shared conditional policy network.
Hista and Numca: Estimate State Value Effectively for LLM Reinforcement Learning: This paper first empirically demonstrates through a newly created State Value Estimation Benchmark (SVEB) that PPO critics in LLM RL almost completely degenerate into the group relative reward baseline of GRPO. It then proposes two state value estimation methods aimed at "no extra rollouts and nearly zero additional compute": Numca Uses numerical milestones to rewrite mathematical reasoning as goal-conditioned RL for credit assignment, while Hista uses the last-layer hidden states of the LLM plus MinDistance for probability-weighted reward averaging. These methods reduce MAE below GRPO/PPO across five SVEB subsets and consistently improve strong algorithms like DAPO/CSIPO on multiple mathematical benchmarks.
How Does Reasoning Flow? Tracing Attention-Induced Information Flow for Targeted RL in LLMs: A generated trajectory is viewed as an attention-induced Directed Acyclic Graph (DAG). A Doob-h-like reweighting is applied to filter information paths that "actually flow toward the answer," and the "flow throughput" of each token is used for non-uniform credit assignment in GRPO. This concentrates training signals on a few critical tokens supporting the answer, consistently outperforming GRPO and various point-wise heuristics in mathematical reasoning tasks.
How Reasoning Evolves from Post-Training Data: An Empirical Study Using Chess: The authors utilize "training LLMs to play chess" as a clean experimental testbed for verifiable RL. By systematically comparing the impact of six self-constructed SFT datasets on RL, they find that while "direct prediction of the Best Move" achieves the highest scores, it leads to unfaithful reasoning after RL. Conversely, "predicting the Best Line (multi-step moves)" yields comparable performance but results in more stable RL and more faithful reasoning. Furthermore, they distill three metrics from SFT checkpoints to predict ultimate RL performance. Finally, their 7B model outperforms gpt-oss-120b on multiple chess benchmarks.
Informed Asymmetric Actor-Critic: Leveraging Privileged Signals Beyond Full-State Access: This paper relaxes the "asymmetric actor-critic" requirement from "the critic must observe the full environment state" to "the critic can observe any state-dependent privileged signals." It proves that any such signals yield unbiased policy gradients and proposes two informativeness tests to identify the most useful signals. Experiments demonstrate that carefully selected partial privileged signals can match or even outperform full-state asymmetric baselines while using less state information.
InftyThink+: Effective and Efficient Infinite-Horizon Reasoning via Reinforcement Learning: This paper upgrades the "iterative reasoning + explicit summary" paradigm from pure SFT to end-to-end RL, proposing InftyThink+. By using trajectory-level GRPO to simultaneously optimize three decisions—"when to summarize, what to retain, and how to continue"—and incorporating an efficiency reward, it achieves a 21% increase in AIME24 accuracy and a 32.8% reduction in latency on DeepSeek-R1-Distill-Qwen-1.5B.
Interaction-Breaking Adversarial Learning Framework for Robust Multi-Agent Reinforcement Learning: From an information-theoretic perspective, this paper characterizes the "mutual influence" between agents using conditional mutual information. It designs attackers that simultaneously mask observations and perturb actions to minimize cross-group mutual information. Consequently, the IBAL policy is trained to maintain stable decision-making even during collaborative collapses. It significantly outperforms existing robust MARL methods under various attacks and "missing teammate" perturbations in SMAC / SMACv2 / LBF.
LABO: LLM-Accelerated Bayesian Optimization through Broad Exploration and Selective Experimentation: This paper proposes LABO, which integrates LLMs as "low-fidelity" evaluation sources into the Bayesian Optimization loop. It decomposes the ground-truth experiment \(f_R\) using a Kennedy–O'Hagan joint Gaussian Process into a scaled LLM prediction \(\rho f_L\) plus a residual process \(\delta\). A "Difference Dominance Ratio" \(p_\Delta = \sigma_\delta^2/(\rho^2\sigma_L^2 + \sigma_\delta^2)\) is used as a gating mechanism to decide whether each candidate warrants an expensive real-world experiment. This allows broad exploration via nearly free LLM queries while concentrating expensive experiments in regions where the LLM is untrustworthy. LABO significantly outperforms vanilla BO, LLAMBO, BOPRO, and CAKE across 6 scientific optimization tasks (e.g., COF, Fullerene) under the same real-world budget.
Laplacian Representations for Decision-Time Planning: This paper introduces ALPS, which utilizes the eigenvector space of the graph Laplacian (scaled to approximate commute-time distance) as a latent space for hierarchical decision-time planning. It first discovers subgoals using k-means in this space and generates high-level paths via Dijkstra, then performs short-range low-level planning in the original state space using CEM with behavior priors. On OGBench offline goal-conditioned RL tasks, this marks the first time model-based planning methods systematically outperform model-free SOTA.
LASER: Learning Active Sensing for Continuum Field Reconstruction: This work models the problem of "where to place sparse sensors" as a POMDP. It employs a "continuum field latent world model" (comprising an encoder, GRU, diffusion dynamics predictor, and implicit neural field decoder) to provide imagined next-step latent states as policy conditions. The cross-attention policy is trained using GRPO with dynamic group filtering and multi-step lookahead rewards. LASER consistently outperforms fixed layouts and offline-optimized layouts on sparse sensing reconstruction tasks across Navier-Stokes, Shallow-Water Equations, and real Sea Surface Temperature (SST) datasets.
Latent Representation Alignment for Offline Goal-Conditioned Reinforcement Learning: By explicitly parameterizing the goal-conditioned value function as the negative Euclidean distance in an asymmetric latent space \(V(s,g)=-\|\varphi_S(s)-\varphi_G(g)\|_2\), combined with continuity regularization and a HIQL-style hierarchical structure, LAVL achieves SOTA on 20 out of 22 OGBench datasets. It increases the success rate on long-range tasks like giant maze and stitch datasets from nearly zero to over 80%.
Learning in Structured Stackelberg Games: This paper introduces a structural assumption to "contextual Stackelberg games" (where the mapping context \(\to\) follower type originates from a hypothesis class \(\mathcal{H}\)) and constructs two new types of learning-theoretic dimensions: the Stackelberg-Littlestone dimension (SLdim), which characterizes online regret bounds, and the \(\gamma\)-SG / \(\gamma\)-SN dimensions, which characterize lower and upper bounds for PAC sample complexity. The authors prove these dimensions strictly outperform various Littlestone / Natarajan dimensions and provide instance-optimal online algorithms (SSOA) and batch algorithms (\(\mathfrak L^*\)).
Learning Query-Aware Budget-Tier Routing for Runtime Agent Memory: BudgetMem reorganizes "runtime agent memory extraction" into a modular pipeline consisting of "filtering → parallel entity/temporal/topic extraction → summarization." Each module is equipped with LOW/MID/HIGH budget tier interfaces. A shared lightweight router is trained via PPO to select tiers for each module upon the arrival of a query, simultaneously improving F1/Judge scores and reducing the average cost per query on LoCoMo, LongMemEval, and HotpotQA.
Learning to Approximate Uniform Facility Location via Graph Neural Networks: This paper designs an MPNN that neuralizes the classic approximation algorithm SimpleUniformFL for Uniform Facility Location. The model can be trained end-to-end using an unsupervised expected cost loss and possesses provable approximation bounds of \(\mathcal{O}(\log n)\) (reducible to \(\mathcal{O}(1)\) with the recursive version). Empirically, it outperforms the classic SimpleUniformFL algorithm and approaches ILP optimality.
Learning to Bet for Horizon-Aware Anytime-Valid Testing: This paper reformulates the design of anytime-valid sequential tests under a strict observation limit \(N\) as a finite-horizon optimal control problem with state space \((t,\log W_t)\). It theoretically proves a three-zone "phase portrait"—optimal Kelly betting in the "on-schedule" middle band, aggressive betting when falling behind, and conservative betting when ahead. A unified DQN agent, trained on various synthetic Beta distributions, automatically learns state-dependent strategies consistent with this phase portrait, achieving higher rejection rates within the deadline and narrower confidence sequences on both synthetic and real data while maintaining anytime-validity via Ville’s inequality.
Learning to Route Languages for Multilingual Policy Optimization: This paper proposes LRPO (Language-Routed Policy Optimization), which treats "which language to use for rollout generation" as a learnable variable. Using a contextual bandit-form language router, it selects the most informative language combinations for each training sample under a fixed rollout budget. By pulling multilingual rollouts into the same scale via offline estimation and online calibration of cross-lingual similarity rewards, it performs GRPO and consistently outperforms GRPO and various dominant-language baselines across Qwen/Llama/Gemma backbones on five multilingual benchmarks.
Learning to Search and Searching to Learn for Generalization in Planning: This paper proposes GSP: a "self-improving generalized planner" that integrates Weighted A* best-first search and Q-learning within a unified loop, using Relational Graph Neural Networks (R-GNN) to represent \(Q_\theta(s,a)\). By training only on small-scale instances, it achieves zero-shot generalization to instances over ten times larger (e.g., from \(\le\) 30 blocks to 488 blocks in Blocksworld). It sets new coverage records across multiple IPC benchmarks, Sokoban, PushWorld, and The Witness, significantly outperforming DRL baselines based on real-time search.
Learning Unmasking Policies for Diffusion Language Models: This paper explicitly models the decoding process of masked diffusion language models (dLLMs) as an MDP. Using GRPO, it trains a single-layer Transformer policy—comprising less than 0.01% of the base model's parameters and taking only token confidence as input—to adaptively decide which positions to unmask at each step. In the semi-AR setting, it matches manual heuristics like Fast-dLLM; in the full-diffusion setting, it significantly outperforms them and demonstrates transferability across models, tasks, and sequence lengths.
LLM-Guided Communication for Cooperative Multi-Agent Reinforcement Learning: This paper proposes LMAC—using LLMs offline to design executable communication protocol code for cooperative MARL. Based on the "state reconstructability" metric, it performs two rounds of feedback iteration (first improving reconstruction accuracy, then reducing cross-agent imbalance). It significantly outperforms communication baselines such as TarMAC/SMS/T2MAC/MASIA on benchmarks like SMAC-Comm, LBF, GRF, and SMACv2, even exceeding the QMIX+State upper bound (where the global state is provided to all agents) in some scenarios.
Long-Horizon Model-Based Offline Reinforcement Learning Without Explicit Conservatism: This paper challenges the prevailing consensus that "offline RL must be explicitly conservative" and proposes Neubay: utilizing a Bayesian perspective for posterior model ensembles, employing long-horizon rollouts (hundreds of steps) to naturally absorb value overestimation, and controlling compounding errors with layer normalization and uncertainty thresholds. It matches SOTA conservative algorithms across 33 datasets in D4RL/NeoRL without pessimistic penalties and sets new records on 7 datasets.
Making Expert Reasoning Learnable with Self-Distillation: DAIL utilizes a hybrid strategy rollout where "Teacher = itself with the expert solution + Student = itself without the expert solution" to rewrite fewer than 1,000 expert trajectories into reasoning chains aligned with the student's policy distribution. It then employs a contrastive loss to suppress high-probability shortcut tokens from a "negative reference model that only sees intermediate answers," achieving up to a 31% improvement in pass@128 on Qwen2.5-Instruct / Qwen3 while reducing the required reasoning tokens by half.
MFPO: Accelerating MaxEnt RL to Gaussian Policy Speeds with Few-step MeanFlow Policy: MFPO employs MeanFlow models (learning average velocity instead of instantaneous velocity) as an RL policy to reduce diffusion policy sampling steps from 20+ to 2 steps. By using an average divergence network to solve action likelihood calculation and ESS-weighted SNIS to combine Gaussian + policy proposals for soft policy improvement, it achieves performance \(\geq\) diffusion baselines on MuJoCo/DMC/HumanoidBench while reducing training time by \(\sim 50\%\).
Metis: Learning to Jailbreak LLMs via Self-Evolving Metacognitive Policy Optimization: Reframes multi-turn jailbreaking as a test-time policy optimization problem under an adversarial POMDP framework. An Attacker and a Metacognitive Evaluator form a closed loop where dense analytical feedback from the Evaluator serves as a "semantic gradient" to guide the Attacker's belief updates and policy improvements. Without retraining any weights, it achieves an average ASR of 89.2% on 10 frontier models (including O1 / GPT-5-chat / Claude-3.7), while reducing token consumption by an average of 8.2x compared to strong baselines.
Mind Dreamer: Untethering Imagination via Active Causal Intervention on Latent Manifolds: This paper proposes Mind Dreamer for Model-Based Reinforcement Learning (MBRL), which utilizes an adversarial generator to "jump" to key anchors on the learned latent manifold of the world model that are not covered by historical trajectories. It resolves credit assignment across breakpoints through newly designed Relay Value/Uncertainty functions (incorporating a \(\gamma^2\) discount), achieving an average \(1.67\times\) speedup over DreamerV3 on DMC and up to \(8.8\times\) on sparse reward tasks.
MindZero: Learning Online Mental Reasoning with Zero Annotations: MindZero reformulates Bayesian Inverse Planning (BIP) into a "self-supervised RL" objective for multimodal LLMs. The reward maximizes the likelihood of observed human actions given the generated mental hypotheses. Trained via GRPO, the model achieves single-forward, fast, and robust online mental reasoning without requiring any manual mental annotations.
MoMa QL: Accelerating Diffusion/Flow Matching Policies for Offline and Offline-to-Online RL via Moment Matching: MoMa QL replaces the standard BC loss with Maximum Mean Discrepancy (MMD), compressing the multi-step sampling of diffusion/flow matching policies into a single-step or few-step "marginal-preserving interpolation" sampler. It achieves a Gym normalized score of 95.5 on D4RL, significantly leading Diffusion-QL (87.9). Due to much faster sampling, it shows greater gains in offline-to-online fine-tuning compared to Consistency AC and Diffusion-QL.
Multi-Agent Decision-Focused Learning via Value-Aware Sequential Communication: SeqComm-DFL treats "multi-agent communication" as a predictor and "joint policy selection" as a downstream optimizer. By combining value-aware message generation, Stackelberg sequential conditions, and implicit differential bi-level optimization, it aligns communication learning directly with team rewards. It achieves 4-6x cumulative reward gains in hospital scheduling and a >13 percentage point win rate increase in SMAC.
Multi-Agent Reinforcement Learning from Delayed Marketplace Feedback for Objective-Weight Adaptation in Three-Sided Dispatch: DoorDash models the "objective weight" regulation of food delivery dispatching as an offline multi-agent reinforcement learning problem. Instead of replacing the existing combinatorial dispatch optimizer, each store-level agent selects a discrete multiplier based on local market states to fine-tune the optimizer's trade-off between "delivery speed vs. bundling efficiency." Using Double DQN with Conservative Q-regularization (CQL), the policy is trained offline from noisy, delayed, and coupled market logs. In a production switchback experiment involving approximately 4,000 geographic regions, the system achieved "increased bundling rates and reduced courier active time without harming customer delivery quality."
Noise-Guided Transport: Imitation Learning from Random Priors: This paper reformulates imitation learning as an adversarial training process where a predictor network fits a frozen random prior network on expert data while moving away from it on agent data. The authors prove that this objective is equivalent to minimizing the Earth Mover's Distance (EMD) between expert and agent distributions. The resulting lightweight method eliminates the need for gradient penalties and successfully learns humanoid robot gaits with an ultra-low data regime of only 20 transitions.
Offline Reinforcement Learning with Generative Trajectory Policies: This paper unifies Diffusion Policies, Flow Matching, and Consistency Policies into a single family called "Generative Trajectory Policies (GTP)" using "continuous-time ODE solution maps." Combined with a closed-form score approximation to align with offline samples and an advantage-weighted training objective, the policy achieves near-perfect scores on hard tasks like AntMaze while maintaining low-latency sampling.
Offline Reinforcement Learning with Universal Horizon Models: The authors lift the restriction that the "Geometric Horizon Model (GHM) can only sample from a fixed discounted distribution" by proposing a Universal Horizon Model (UHM) capable of directly sampling future states over an arbitrary horizon \(n\). By truncating excessively long horizons using a Winsorized geometric distribution, the proposed method achieves an average success rate improvement of approximately 14% over the strongest baselines across 100 OGBench tasks.
One Bias After Another: Mechanistic Reward Shaping and Persistent Biases in Language Reward Models: This paper systematically measures five types of biases—length, uncertainty, position, sycophancy, and model style—across five high-quality RMs (including SOTA Skywork-Reward-V2). It categorizes them into "low complexity (linearly repairable)" and "high complexity (linearly non-repairable)" and proposes mechanistic reward shaping. By using DiffMean linear probes to perform null-space projection on the final-layer hidden states, the method significantly mitigates the first three types of biases and generalizes OOD to best-of-N without compromising RewardBench2 accuracy.
ORLoopBench: Solver-in-the-Loop Benchmarks for Self-Correction and Behavioral Rationality in Operations Research: The authors formalize "repairing an infeasible operations research model" as a solver-in-the-loop MDP where each modification step requires re-running Gurobi to obtain Irreducible Infeasible Subsystem (IIS) feedback. They release the ORLoopBench benchmark (5362 LP/MILP repair instances + inventory decision bias evaluation) and utilize RLVR to train an 8B model that outperforms closed-source APIs (92.4%) with a 95.3% RR@5 on LP repair tasks.
PAC-Bayesian Reinforcement Learning Trains Generalizable Policies: This paper provides the first PAC-Bayesian RL generalization bound that explicitly depends on the mixing time of the Markov chain and scales only linearly with the long horizon \(1/(1-\gamma)\). By embedding this bound as an "alive" training objective within SAC, the authors derive the PB-SAC algorithm—delivering non-vacuous deployment certificates and competitive performance on MuJoCo continuous control tasks simultaneously.
Parameter-free Dynamic Regret: Time-varying Movement Costs, Delayed Feedback, and Memory: This paper presents the first parameter-free algorithm for the triple setting of unconstrained online convex optimization (OCO), time-varying movement costs, and dynamic comparator sequences. By reducing delayed feedback and time-varying memory to OCO with time-varying movement costs, the authors provide a unified refresh of dynamic regret upper bounds for these three scenarios.
PAWS: Preference Learning with Advantage-Weighted Segments: PAWS identifies that the common practice in Preference-based RL (PbRL) of "training utility functions at the segment level but using them at the step level" causes distribution shifts. It proposes training advantage functions and updating policies consistently at the segment level. By using segment-level advantage weighting with trust-region constrained weighted maximum likelihood, PAWS significantly improves preference signal utilization and success rates on Meta-World robotic manipulation tasks.
Perceptual Flow Network for Visually Grounded Reasoning: Abandoning the traditional RLVR approach of "hard supervision using precise boxes from vision experts," PFlowNet models the act of perception as a structured latent variable called Perceptual Flow. It approximates the ideal reasoning-oriented posterior using a variational distribution \(p_\theta(Z|X)\) and trains it via Sub-TB variational RL, multi-dimensional rewards, and Vicinal Geometric Shaping. This allows the 8B Qwen3-VL to achieve new SOTA scores of 90.6% on V* Bench and 67.0% on MME-RealWorld-lite.
Plug-and-Play Benchmarking of Reinforcement Learning Algorithms for Large-Scale Flow Control: This paper introduces FluidGym—the first RL benchmark for active flow control implemented entirely in PyTorch without external CFD solver dependencies. It is end-to-end differentiable, natively supports multi-agent and 3D flow fields, and provides standardized results from 25k+ GPU hours across 13 2D/3D environments using PPO/SAC/TD-MPC/DPC.
Position: Deployed Reinforcement Learning should be Continual: This is a position paper: the authors argue that any RL system that still obtains evaluative reward signals after deployment and whose environmental complexity exceeds the agent's representation/computational capacity is essentially a Continual Reinforcement Learning (CRL) problem. It advocates for abandoning the "train-then-fix" paradigm in favor of allowing agents to continuously update policies during deployment.
Practical and Optimal Algorithm for Linear Contextual Bandits with Rare Parameter Updates: In linear contextual bandits, the authors explicitly decouple "when rewards are received" and "whether context within an interval can be utilized," two axes previously confused by the term "batched." They define a more practical "rare parameter updates" setting (restricting only reward-driven updates while allowing reward-free context adaptivity). Based on this, they propose BLCE-G and BLCE, which require only \(\mathcal{O}(\log\log T)\) parameter updates. BLCE-G is the first to achieve minimax-optimal regret \(\widetilde{\mathcal{O}}(\sqrt{dT\log K}\wedge d\sqrt T)\) across both small-\(K\) and large-\(K\) regimes, while BLCE removes the G-optimal design bottleneck to achieve the lowest runtime among all optimal algorithms. This approach is further extended to generalized linear bandits with BGLE, which eliminates dependence on the worst-case curvature parameter \(\kappa\).
Probing RLVR Training Instability through the Lens of Objective-Level Hacking: The authors propose the "objective-level hacking" framework, attributing the phenomenon of growing training-inference discrepancy in MoE models during RLVR to biased pseudo-signals introduced into the optimization objective by token-level weight distortions. Experiments on a 30B MoE model verify that "bias (not variance) is the culprit."
ProRL: Effective Reinforcement Learning for Proactive Recommendation via Rectified Policy Gradient Estimation: Addressing the issue where naive policy gradients collapse into "equal-length repetitive paths" in proactive recommendation tasks, the authors theoretically attribute the failure to the "length shortcut" and high variance induced by positive mean stepwise rewards after path-level reward decomposition. They propose ProRL: using Stepwise Reward Centering to subtract a constant baseline from the expected reward at each step to eliminate length bias, and Position-Specific Advantage Estimation to reduce variance via GRPO-style group baselines based on step positions. Experimental results on three real-world datasets show that ProRL outperforms heuristic, supervised, and LLM-based SOTA methods across four metrics: IoI, IoR, CTR, and Coherence.
Provable Benefit of Curriculum in Transformer Tree-Reasoning Post-Training: This paper provides the first rigorous sample complexity proof for "easy-to-hard" curriculum RL post-training: on the state-conditioned autoregressive reasoning trees of transformers, if the curriculum maintains the difficulty ratio of adjacent stages at the level of the \(L/p\)-th root of the target difficulty, the total sample complexity can be reduced from exponential \((C^\star)^L\) in direct training to polynomial \(L\cdot (C^\star)^{p_\max}\) in the curriculum version.
Quantifying and Optimizing Simplicity via Polynomial Representations: The authors propose using "Chebyshev polynomials fitted along data interpolation paths" as a low-dimensional function-space proxy for neural networks. They define "Effective Degree" (ED)—the sum of absolute coefficients weighted by their polynomial orders—as a scalar measure of "how simple a function is." ED predicts the generalization gap on CIFAR-10, ImageNet, and CLIP more accurately than existing proxies like sharpness or parameter \(L_2\) norms. Furthermore, the estimation pipeline is differentiable, allowing ED to serve as a "simplicity regularizer" during training, which consistently yields gains across image, text, CLIP fine-tuning, and reinforcement learning tasks.
Randomized Advantage Transformation (RAT): Computing Natural Policy Gradients via Direct Backpropagation: This work reformulates Tikhonov-regularized Natural Policy Gradient (NPG) as a "standard policy gradient with transformed advantages" via the Woodbury identity. By utilizing Randomized Block Kaczmarz iterations on mini-batches to solve this transformation, the method bypasses explicit Fisher matrix construction, Conjugate Gradient inner loops, and architecture-dependent curvature approximations like KFAC. It computes natural policy gradients using a single standard backpropagation pass, matching or exceeding the performance of TRPO/ACKTR/KFAC on MuJoCo and Procgen benchmarks.
Reinforced Sequential Monte Carlo for Amortised Sampling: This work unifies hierarchical variational inference (HVI), MaxEnt RL, and Sequential Monte Carlo (SMC)/Annealed Importance Sampling (AIS) into a single framework. The learned policy and flow function serve simultaneously as the proposal kernel and twisting target for SMC. Conversely, near-target samples produced by SMC are used as an off-policy behavior policy to train the neural sampler. Coupled with adaptive weight tempering and importance-weighted experience replay, this approach improves both mode coverage and training stability on multi-modal targets and the alanine dipeptide Boltzmann distribution.
Reinforcement Learning for Reachability: Guaranteeing Asymptotic Optimality: This paper addresses the problem of learning reachability specifications on unknown MDPs by proposing a direct learning algorithm that refines PAC parameters in stages. It proves the existence of a finite stage \(K_{\mathsf{opt}}\) with probability 1, after which only the optimal policy is output. This stage is explicitly characterized using internal MDP parameters, and empirical results on quantitative verification benchmarks confirm that the optimal policy emerges in very few stages (median \(k=2\)).
Reverse Flow Matching: A Unified Framework for Online Reinforcement Learning with Diffusion and Flow Policies: Addressing the core challenge that "no direct target policy samples exist in online RL," this paper proposes Reverse Flow Matching (RFM). By transforming the training of diffusion/flow policies to fit Boltzmann distributions into a "posterior mean estimation given intermediate noise" problem, it uses Langevin Stein operators to construct zero-mean control variables. This unifies existing "noise expectation" and "gradient expectation" methods into a single family of estimators. Consequently, it enables flow policies (not just diffusion policies) to sample Boltzmann distributions for the first time, achieving more stable and superior performance on continuous control benchmarks compared to diffusion baselines.
Revisiting Regularized Policy Optimization for Stable and Efficient Reinforcement Learning in Two-Player Games: KLENT recombines three mature components—reverse-KL regularization (to control policy update scale), entropy regularization (to maintain exploration), and λ-return (to balance bias and variance)—into model-free self-play RL. It achieves 4x the training efficiency of Gumbel AlphaZero across five board games and provides convergence proofs for both normal-form and finite-length scenarios.
RL-SPH: Learning to Achieve Feasible Solutions for Integer Linear Programs: This paper proposes RL-SPH — an end-to-end Reinforcement Learning (RL) heuristic that does not rely on external ILP solvers and independently produces 100% feasible solutions. By utilizing "feasibility rewards + two-phase strategy + feasibility-aware neighborhood search," the Graph Transformer agent reduces the average primal gap by 28.6x on ILPs containing non-binary integer variables.
RL4RLA: Teaching ML to Discover Randomized Linear Algebra Algorithms Through Curriculum Design and Graph-Based Search: RL4RLA utilizes a "numerical curriculum of increasing difficulty + Monte Carlo Graph Search (MCGS)" to drive an RL agent to compose interpretable Randomized Numerical Linear Algebra (RLA) algorithms from linear algebra primitives, successfully reproducing classic methods such as Sketch-and-Precondition, Randomized Kaczmarz, and Newton Sketch.
RLVE: Scaling Up Reinforcement Learning for Language Models with Adaptive Verifiable Environments: RLVE transforms language model RL training from "static problem sets" into 400 programmable verifiable environments where problems are algorithmically generated and rewards are verified via code. By adaptively increasing problem difficulty as the policy model improves, the training signal is kept at the frontier of model capability; on a strong 1.5B model already saturated by standard RLVR, RLVE achieves a \(3.37\%\) average gain across six reasoning benchmarks using only 1/3 of the compute (compared to a \(0.49\%\) gain from continued standard RL).
RulePlanner: All-in-One Reinforcement Learner for Unifying Design Rules in 3D Floorplanning: This paper integrates seven categories of industrial design rules for 3D chip floorplanning into a unified actor-critic RL framework. The core mechanism compiles each rule into a \(W\times H\) "adjacency matrix mask" to proactively block illegal positions using large negative values before policy softmax. Combined with a hybrid action space (discrete position + continuous aspect ratio) and Transformer-encoded netlist features, it is the first single agent capable of simultaneously satisfying seven rules—including boundary, grouping, multi-layer alignment, and non-overlap—while demonstrating zero-shot transferability to unseen circuits.
Safe In-Context Reinforcement Learning: This paper introduces safety constraints to in-context reinforcement learning (ICRL) for the first time, proposing SCARED. During pre-training, it utilizes an exact-penalty Lagrangian with a single multiplier and a hinge function to enable a Transformer policy to adapt to CMDPs at test-time without any parameter updates. By conditioning on cost-to-go context, the policy achieves monotonically increasing rewards and decreasing costs on OOD Grid / MuJoCo / Velocity benchmarks, allowing smooth switching between conservative and aggressive behaviors based on a user-provided budget \(\delta\).
Safe Reinforcement Learning with Preference-Based Constraint Inference: This paper proposes PbCRL, which utilizes an extended Bradley-Terry preference model with a "dead-zone" to learn safety constraints from trajectory comparisons. By incorporating a signal-to-noise ratio (SNR) regularization to prevent the cost function from flattening and employing a two-stage training pipeline (offline pre-training + online sparse-label fine-tuning), the method significantly reduces costs while maintaining rewards across Safety Gymnasium, autonomous driving, and language model alignment tasks.
Safety Generalization Under Distribution Shift in Safe Reinforcement Learning: A Diabetes Testbed: Based on the UVA-Padova physical model, the authors developed a unified T1D/T2D diabetes simulator. They discovered that while 8 mainstream Safe RL algorithms satisfy safety constraints on training patients, their Time-in-Range (TIR) drops by 8–13% when deployed to unseen patients. They propose using Basis-Adaptive Neural ODEs to predict blood glucose trajectories and apply predictive shielding to filter dangerous actions at test time, restoring 13–14% TIR for baselines like PPO-Lag and CPO on OOD patients.
Shapley Neuron Values for Continual Learning: Which Neurons Matter Most?: The authors adapt Shapley values from cooperative game theory to the "filter" level of Convolutional Neural Networks, using a triple approximation of Monte Carlo, truncation, and Multi-Armed Bandits to estimate continuous importance rankings for each neuron. By freezing the Top-\(r\%\) "expert" neurons and leaving the rest plastic for further training, they achieve a \(+2.88\%\) accuracy gain in Class-Incremental Learning and a \(+6.46\%\) gain in Task-Incremental Learning on ImageNet-1k compared to the second-best buffer-free method, without storing samples or expanding the architecture.
Single-Rollout Hidden-State Dynamics for Training-Free RLVR Data Selection: SHIFT utilizes the "start token → end token" hidden-state difference \(\Delta(x)=\mathbf{e}(x)-\mathbf{s}(x)\) from a single greedy decoding rollout as both a utility proxy and a diversity feature for RLVR samples. It then employs a quality-weighted farthest-first CoreSet to select a minimal set of samples from a large unlabeled pool without training, rewards, or ground truth answers.
Space-sampled Value Decay: Forgetting Mechanisms for Non-stationary Deep Reinforcement Learning: For non-stationary reinforcement learning scenarios where the environment silently "drifts" without task IDs or context prompts, this paper proposes Space-sampled Value Decay (SsVD). By sampling from the state space and continuously decaying the Q-values of "unvisited or stale" states toward a baseline, the agent actively forgets outdated knowledge, thereby maintaining high returns in dynamically changing environments.
SPHERE: Mitigating the Loss of Spectral Plasticity in Mixture-of-Experts for Deep Reinforcement Learning: This paper formalizes the loss of plasticity in Mixture-of-Experts (MoE) policies during continual reinforcement learning (CRL) as the decline of the spectral entropy effective rank of the empirical NTK matrix. It employs Gauss-Newton and Kronecker factorization to reduce this to a computable proxy based on the "expert feature Gram matrix." Finally, a one-line Parseval penalty (SPHERE) is used to increase this proxy, improving task success rates by 133% and 50% in MetaWorld and HumanoidBench CRL settings, respectively.
Stochastic Minimum-Cost Reach-Avoid Reinforcement Learning: Ours proposes the Reach-Avoid Probability Certificate (RAPC), which utilizes a max-min-clamped Bellman contraction operator to lower-bound the reach-avoid probability. Combined with a "compensation factor" to normalize against adversarial \(\gamma^T\) decay and symmetric gradient projection to jointly optimize conflicting "cost" and "reach-avoid probability" objectives, the method achieves lower cumulative costs and higher reach success rates than RC-PPO / RESPO / CPPO on MuJoCo tasks.
The Obfuscation Atlas: Mapping Where Honesty Emerges in RLVR with Deception Probes: Ours constructs MBPP-Honeypot, an RLVR environment that naturally induces reward hacking (hardcoding test cases). It systematically characterizes four types of strategies resulting from using "white-box deception probes as training signals": honest, blatant deception, obfuscated policy, and obfuscated activations. Ours demonstrates that stable convergence to an honest strategy in reward hacking scenarios is achievable provided both the KL regularization coefficient \(\beta\) and the probe penalty coefficient \(\alpha\) are sufficiently large.
The Shape of Reasoning: Topological Analysis of Reasoning Traces in Large Language Models: This paper treats LLM chain-of-thought as a "point cloud" in embedding space. It uses Topological Data Analysis (TDA) to extract persistent homology features as an objective measure of reasoning quality. Experiments on the AIME dataset demonstrate that TDA features significantly outperform traditional graph statistics in predicting Smith-Waterman alignment scores (average \(R^2=0.236\) vs. average \(R^2=0.064\)).
The Surprising Difficulty of Search in Model-Based Reinforcement Learning: The authors counter-intuitively demonstrate that search failure in model-based RL is not caused by model inaccuracy, but rather by overestimation bias stemming from the policy mismatch between the MPC behavior policy and the value function training policy. They propose the MRS.Q algorithm, which utilizes a "min" operation over an ensemble of 10 value functions, consistently outperforming SOTA methods like TD-MPC2, BMPC, BOOM, and SimbaV2 across over 50 continuous control tasks.
Extra-CoT: A Chain-of-Thought Compression Framework under Extreme Compression Ratios: Extra-CoT proposes a three-stage framework (semantic-preserving compressor → mixed-ratio SFT → hierarchical reward RL) that maintains reasoning accuracy even under extreme compression ratios (retaining only 20% of tokens). It achieves a 73% token reduction on MATH-500 while improving accuracy by 0.6%.
Tracking Drift: Variation-Aware Entropy Scheduling for Non-Stationary Reinforcement Learning: AES projects the exploration intensity scheduling problem of maximum entropy RL into the dynamic regret framework of Online Convex Optimization (OCO), deriving a hard theoretical result that "entropy weight should be proportional to the square root of the environment drift magnitude." By using TD-error quantiles as an observable drift proxy, it achieves a fully online, algorithm-agnostic entropy scheduler—halving catastrophic recovery times across SAC / PPO / SQL / MEow frameworks in 12 tasks.
Trajectory-Level Data Augmentation for Offline Reinforcement Learning: This paper proposes LIFT: in active alignment tasks, it leverages the geometric properties of trajectories to turn redundant zig-zag paths from suboptimal logging policies into "shortcuts." These synthetic transitions are fed to a lightweight augmentor that replaces logging actions during data collection. Consequently, offline CQL significantly outperforms standard offline RL and warm-start SAC across various settings, including low-to-high dimensional and partial observation environments.
Turning Bias into Bugs: Bandit-Guided Style Manipulation Attacks on LLM Judges: Treating known style preferences of LLM judges (verbosity, lists, emojis, etc.) as an attack surface that can be systematically exploited, the authors model the attack as a contextual bandit. Using LinUCB, they adaptively select from 8 semantic-preserving style rewriting actions within a 25-query budget, achieving an attack success rate of \(>65\%\) and score inflation of \(+1 \sim 2\) points (on a 9-point scale) across 5 mainstream judges while bypassing style control defenses.
Unified Value Alignment and Assignment in Cross-Domain Offline Reinforcement Learning: This paper reveals the "value misassignment" problem under heterogeneous cross-domain offline RL settings—where source data originates from multiple domains and policies, leading to inaccurate advantage evaluations that cause data filtering failure. The proposed V2A framework addresses value alignment and assignment issues through time-consistent modal representation learning and modal-aware advantage learning, outperforming DVDF by 21.4%.
Unlocking Zero-Shot Geospatial Reasoning via Indirect Rewards: The authors utilize "whether ground-level street views and satellite images can be localized to the same coordinates" as a verifiable indirect reward. Using GRPO, they perform two-stage post-training (CoT scaffolding + RL self-exploring) on Qwen2.5-VL-7B. This allows the model to learn general reasoning capabilities from GPS metadata alone, which generalizes zero-shot to 25+ geospatial tasks.
Video-Based Optimal Transport for Feedback-Efficient Offline Preference-Based Reinforcement Learning: To address the high annotation cost in Preference-based RL (PbRL), which typically requires thousands of human comparisons, VOTP encodes trajectory segments into a semantic space using Video Foundation Models (ViFM). It then applies Optimal Transport (OT) to align a "small labeled set" with a "large unlabeled set" to propagate preferences and automatically generate pseudo-labels. With only 10 annotations, it learns effective rewards that outperform existing offline PbRL methods on D4RL locomotion and MetaWorld manipulation tasks, nearly matching Oracle performance.
Vulnerable Agent Identification in Large-Scale Multi-Agent Reinforcement Learning: This paper investigates the bi-level NP-hard problem of "identifying the \(K\) most vulnerable agents in a large-scale MARL system with \(N\) agents." It formalizes this as HAD-MFC (Hierarchical Adversarial Decentralized Mean Field Control). By applying the Fenchel-Rockafellar transformation, the training of the lower-level worst-case adversarial policy is collapsed into a regularized "robust mean-field Bellman operator." The upper-level combinatorial selection problem is then transformed into an MDP with dense rewards, solved via greedy search or RL. The decomposition is proved to maintain optimality, and the method outperforms baselines in 17 out of 18 tasks.
What Does Reinforcement Learning for Visual Tool Use Really Learn?: This paper proposes the MED framework to systematically analyze the actual learning effects of visual tool-use RL in crop-and-zoom scenarios—finding that the performance gains brought by RL training primarily stem from intrinsic capability improvement rather than enhanced tool mastery; the model mainly learns how to safely coexist with tools rather than truly mastering them.
You Can Learn Tokenization End-to-End with Reinforcement Learning: This paper models the decision of "where to draw token boundaries" in byte-level LLMs as a discrete stochastic process. By using a score function estimator equipped with early-exit relative rewards, time discounting, and batch-relative advantage, it achieves end-to-end learning of tokenization. The method outperforms straight-through estimators on a 147M natural language model and a 90M code model, approaching the performance of BPE-guided downsampling.